Thursday, April 30, 2009

Day 30 - Performace Tests

There's a set of performance tests to do before choosing the final migration infrastructure.
There's tests over the initial data extraction from OS/390 and loading into AS/400.
There's tests over the data transformation rules.
There's tests over the final data loading into GIS.

From this set of tests, only ours, the data transformation rules, have been performed so far.
As expected, AS/400 is not a fast machine to run Data Fusion transformation engine. We already knew this, but AS/400 system administrators always say we're wrong and that they have a fantastic and fast machine. Their pride has been hurt...

As an example, running entirely in AS/400, 400000 source records have migrated at ~800 records per secord. Note that this was a local data migration execution, the source and target databases are on the same machine as the transformation engine.
But running the same transformation engine on my, not very fast, desktop computer with 1GB of RAM, the same 400000 records migrated at ~1400 records per second. Note that on this test the source data was retrieved from AS/400, transformed on my desktop compute, and sent back to AS/400, i.e. the data had to travel over a common 100 Mbit network.

The protocol loading interface was also subject to tests and it's CPU bounded, We have one single processor, of a 4 processor machine, and it consumes 100% of that CPU. When running on AS/400, it perfomed ~12000 records per second. Running on the desktop it does only ~8000 records per second.

The scalability tests on AS/400, with a single CPU for us, proved that a linear execution is faster than a two parallel executions. We haven't had the chance to test parallel execution with more CPUs.

We've been discussing what was the best approach allow us to scale and one thing is for sure, we need to run the transformation engine outside the AS/400, because on the D day we have to share, as it's happening now, the machine with others, specifically with GIS data loading.
Probably we'll have a 6 or 8 CPU machine running Linux and connected to AS/400 over a 1 Gbit network. This will allow us to run multiple transformation engines and free AS/400 for the other data migration processes.

Day 28 - DB2 Problems

We've already used AS/400 and DB2 on previous migrations.
Our experince tell us that they're not very good when it comes use it as the main data migration machine.

There's a set of problems on DB2 that we were afraid to find again on this AS/400 version. This is the latest AS/400 version and, in the words of the system administrator, currently it may be the fastest AS/400 in Portugal.

Our pessimistic expectations proved to be right. DB2 continues to have problems, with seem to be bugs on its engine.
For instance, executing an query works fine, but creating a view over that same quety is impossible because DB2 crashes with an internal error!
I had to create a new table with data from the resulting query.

I stil have not found the "query exploded" problem, which seem related with DB2 optimizer that crashes when trying to optimize the query.

Our confidance on DB2 on AS/400 is a bit low right now, since we proved to be right when we said that we were expecting problems from it.

Friday, April 24, 2009

Day 24 - New Mapping Rules

There was a new meeting for mapping rule definition. Most of the transformation rules are closed, there's just some tweaks related with foreign addresses and some internal codes.

Next week there will be more full migration cycles with fresh data, since we've discovered that the current test data was not correct.

Meanwhile, performance tests are being performed on AS/400 and we were able to convince the system administratos to reconfigure the anti-virus not to read Java .jar files, allowing a better performance on our desktops.
And yesterday, the desktops were automatically updated and rebooted. This policy has also been removed from our desktops, since it is far too dangerous to perform such operations when a data migration is being controled from our desktops.

Saturday, April 18, 2009

Day 17 - First Try

The protocol loading interface is ready, it's cool but I did not code it.
And I have the first mappings ready to roll.

We've done a full cycle, with about 85% of the transformation rules implemented, and we got less problems than I expected for the first try. 
Foreign zip codes were wrong, which we were expecting since this mapping specification is still under discussion.
People names were wrongly specified, since the surname was being lost.
The were some empty honoricif titles, since they are also still under specification.
A couple of bytes were on the wrong place on the protocol file.

We've tried it with one hundred records from the source, which resulted in, aproximatly, 2600 records in the protocol file.

The whole process was executed in our desktop computers, since the AS/400 is still not ready for us, the Java 1.5 is missing. Next week I'm expecting to execute the whole process in the AS/400.

Tuesday, April 14, 2009

Day 14 - First Mapper

I've just set up the Data Fusion IDE and server on my machine. Until we install the server on AS/400, my desktop will have to do the trick. I'm using the JTOpen JDBC driver to connect to AS/400 files via DB2.

In order to create the Data Fusion tables layout automatically, I've set up a simple command line application that connects to AS/400 through the same JTOpen JDBC driver and queries the files/tables metadata, in particular it's structure, the columns name, data type, precision, etc..

I've also implemented some transformation rules to run some tests and everything worked fine.
I'm still waiting for some target specification information from the target team, but I believe that I'll have two mappers ready for some serious testing real soon.

I've also received the production data, which means know I can test my data profiler with real data.

Monday, April 13, 2009

Day 13 - Data Profiler Enhanced

Today the data profiler has received a ranking function, that ranks the profiled data and cuts off data that is considered as discarded.

When data is profiled, a tupple with the term and the context is saved.
The context helps in the evaluation of the term, specially to know if it's a false positive.
The context is also used to cut off data.
It is possible to maintain only terms that have a ranking index, calculated from the term context, equal or higher than a value. Or, it is possible to maintain only terms that differ from the immediately higher rank higher than a giver percentage, i. e., if the current term differs mores than a given percentage from the last term, the current term and all with lower ranks are discarded.

With this enhancements, the data profiler is finished.
Unless there's some special need to profile other data, it is done for now.

Thursday, April 9, 2009

Day 9 - Data Profiling

I'm continue to develop the data cleaning function for the honorific titles.
Currently I'm developing a data profiler, a stand alone Java application that queries DB2 via JDBC JTOpen, that has already found some anomalies in the person contact name. There's the expected typos, some ad-hoc markers, and finally, the usual user creativity.

These user creativity anomalies are the funniest of them all. In the person contact name there's random chars, like "USO USSSSS", training data, like "TEST" and totally nonsense person names, like "EXITING NOW", "ALSO EXITING" and "NEW".

Day 8 - Honorific Title Cleaning

I've started to work on a data cleaning function to detect honorific titles in a free text field.
The text field holds the contact name of a person that might be preceded, or followed by, the honorific title.

I'm working on several distinct approaches. The first one is the classical one, where I'm using a dictionary. The second one is automatic where the algorithm will try to infer if an honorific title exists or not. And the last one is an extension of the automatic, using an exclusion dictionary.

In the first tests I've performed I've found a lot of false positives, therefor I'll use a first name dictionary to exclude those.

Tuesday, April 7, 2009

Day 7 - First Mappings

The first mappings have been specified.

I've spent almost all day with the target team and the source teams, one from the main system and another from a secondary system.
It was a productive day, since we were able to map over 90% of both systems, that will migrate into the same target specification. The doubts and problems detected will be easily solved.

As expected, some cleaning will be performed by the trasformation rules. For instance, in some circumstances, the honorific title will have to be automatically inferred from a free form text field.
It will also be necessary to perform duplicate detection and entity fusion before executing the migration.

None of these seems really problematic at this time.

Monday, April 6, 2009

Day 6 - Setup for First Mapping

The plan for phase has rolled out and we've prepared for the first mapping meeting tomorrow.
We've also set up the development environment on the computer that has arrived.
My computer has not arrived yet, and that will become a problem if it does not arrive in the next two days.
We have also requested for a new data loading into DB2, but this time with the structure will be preserved.

Friday, April 3, 2009

Day 3 - Target specification

The target specification arrived today.
It's an hierarchical structure that will result in a flat file where each line will represent a specific piace of information, identified by a code.
We've started to work on it, but since the specification is more like a meta-model, we had to make some adjustments to comply with such a dynamic structure.
All problems have been solved, they were actually simple decisions that the source team helped to make.
The DB2 model has been created and our scripts turned it into mapping sheets, that will be used in the functional mapping performed by the source and the target teams.

A piece of the source data has become available. It has been extracted from the OS/390 in EBCDIC and in text formats in order to allow some AS/400 loading tests.
The data looks correct, but each row has been loaded as a single record, the copybook rules have not been applied and therefor, it will have to be loaded again. Nevertheless, the fact that the information looks correct is a good enough as the result of the first try.

The first workstation also arrived today, allowing us to install and configure some of the software required for the data migration, and allowing us to access to the development AS/400 machine.

It will not depend on us, but if next week things work out the way we've planed, we'll be able to recover the one day delay.

Day 2 - First delay

The second day of the data migration project was just like any other second day of any other project: the basic stuff requested to the sponsor and to the target team, and guaranteed by the project sponsor, was not ready.
Management tried to get things done, but was unable to comply with the plan, therefor the project delayed one day.

Unfortunately this is so usual that actually is becoming a standard de facto project start up, no matter what technological project it is, it usually starts delayed.

Wednesday, April 1, 2009

Day 1 - Kick off

The project has been officially kicked off.

There was the traditional meeting where all teams where present and an overview of the project was presented. All teams involved have previous data migration experience, which is good.

The project should be completed around mid October, but I felt that management confidence to comply with the date is not very high. This is not a bad signal as one might think, since management showed that it is aware of some real difficulties. Those difficulties may become technical challenges and may delay the project.
Nevertheless, I do believe that the original due date can be achieved.

There will be two phases. The first phase, the current one, will use a small entity subset in order to test the performance of the ETL and to tune the process.
The second phase will be the continuity of the previous phase, but now including all applications and entities.


The project itself consists of migrating the current insurance application, running on OS/390 mainframe, into the new system, the GIS Non-Life, running on AS/400, all using DB2, SAM and VSAM.
There are some other satellite applications running on SQL Server 2005 on Windows.
The ETL tool will be Data Fusion and it will run on AS/400 under the Qshell.

The ETL process will be implemented in a slightly different way from the usual scenario. Usually the data is extracted from the source database and loaded into the target database. This is not the case in this project.
Eventually, the data will be loaded into the final target database, but there's an intermediate step. The ETL process performed by Data Fusion will deliver the data on flat files that will be later consumed by the GIS loader. The GIS loader will validate and load the data into the final database.

Since our (ETL team) experience tells us that performance on AS/400 may be a problem during the ETL process, there was some discussion over the performance tests to be performed during phase one.

If the performance becomes an issue, the data migration may have to be incremental, instead of one-shot. This will be a problem because the source team cannot identify which data has been changed and needs to be migrated again to refresh the target system. One challenge that has been given to us, is to think in a fast way to identify if a record has changed and needs to be migrated again, just in case of a B plan is required.

Data Migration Diary

I'm involved on a new data migration project, and since I'm really interested in this area, I'm starting this blog as a diary.

I won't right entries everyday, since there will be times where only "boring" stuff will happen.
But, I'll write as much as possible about the project management, the technical difficulties, the solutions, the workarounds, the mapping process, the business rules, and, of course, the ETL process.

I have experience in data migration projects in several areas, such as banking and public administration. This will be my first data migration project in the insurance area, which will be quite a challenge because the business rules in this area are very complex.

Due to confidentiality purposes, I'll keep the project sponsor and all parts involved incognito.
The only exception will be the references to the technical stuff involved.