Data Migration Diary: etl

Showing posts with label etl. Show all posts

Friday, January 22, 2010

297 - Data Loading Files Performance Problem

The migration scope has increased.
New source data has come into play and thus new mapping specifications have been made.

This has resulted in a really big data loading performance problem.
Our data transformation procedure is still fast, but the creation of the GIS loader files from the transformed data is starting to give us some headaches.

The GIS data loader has a flat file structure that is very verbose, the loading of each single value of each record is done through a 300 character text file line.
This means that a single database table row is transformed into, something like, the same number of text lines as the number of columns the row has. Plus the file and records header and footer structure.

As an example, house insurances data transformation is performed in around 4 hours, sequential time, and it generates about 184 millions of records. This is all performed in the transformation server.
These records are then exported into the GIS data loader file format from the Windows directly into the AS/400. This procedure is now taking much time, over 6 hours, sequentially, in the best case scenario.
This is obviously too much time, so we are exploring several hypotheses, ranging from creating a parallel file write process; writing the files locally, with and without compression, and transfer them via FTP to AS/400; cluster indexes with full coverage maintained in a different disk; splitting the database schemes across several disks.
Some of these techniques can, and will, be combined.

We have to tune the process from our side since there is not much the GIS or the AS/400 can do when it comes to massive data load tuning.
We are facing a lot of hard work in these next days.

Wednesday, November 25, 2009

Day 238 - Migration Status

Finally I got a set of entities that migrated correctly.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.

The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:

Car Insurance Claims (470.563 records):

Data Fusion transformation procedure: 51m
File creation for GIS loading: 2h 10m
GIS file loading: 238h sequential time (time calculated based on 16h of loading)

Personal Accident Insurance Claims (29.303 records):

Data Fusion transformation procedure: 1m 51s
File creation for GIS loading: 1m 40s
GIS file loading: 2h 17m with 3 parallel processes (5h 9m sequential time)

Entities (682.569 records):

Data Fusion transformation procedure: 6m 3s
File creation for GIS loading: 23m
GIS file loading: 5 h 55m with 5 parallel processes (27h 17m sequential time)

Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

We have a functional requirement that needed our technical expertise to be solved, it was a small challenge but an interesting one.

Data Fusion already has a convert sequence function, which receives a key a returns a sequence.
It is very useful to convert things like keys and identifiers between the old system and the new system.
It works on a very simple basis: the new system key is requested based on the old system key, if it exists it is returned, otherwise it is created based on the addition of 1 to the current maximum value present, and the returned.
Here is an example:

Old Key
ACC0005
ACC0008
ACC0017
[...]

New Key
1
2
3
[...]

This is persistent between data migration executions, which means only new keys will be added, and it is quite efficient since it is entirely loaded into memory.

This has suit us for almost a decade. All our previous data migrations never required identifier conversions, or creations, based on anything other than a set of identifiers.
But here a new challenge raised. We still must return the new system key based on the old system key, but the creation of a new system key is no longer based on the old system key.
The creation of the new system key is now based on a subset of the old system key.
Here is how it works:

Request the new system key based on the old system key
If if exists: return the value
If if does not exist:

Select the next value from the old system sequence key
Add an entry to the conversion table registering the old system key, the old system sequence key and the new system key
Return the new system key

Here is a very simple example for better understanding.
Consider the old system key to be based on the source system account number (ACC) and historical image (IMG). Consider the sequence being incremented only by the account number. A conversion table would be something like:

Old Key
ACC0005 # IMG0001
ACC0005 # IMG0002
ACC0008 # IMG0001
ACC0017 # IMG0001
ACC0017 # IMG0002
ACC0017 # IMG0003
[...]

Old Sequence Key
ACC0005
ACC0005
ACC0008
ACC0017
ACC0017
ACC0017
[...]

New Key
1
2
1
1
2
3
[...]

Note that the old system sequence key varies based on a subset of the old system key.

Obviously the above example is purely academic, but the this easy to use mechanism is being used to solve a complex transformation problem and it has became extremely useful.

Wednesday, April 1, 2009

Day 1 - Kick off

The project has been officially kicked off.

There was the traditional meeting where all teams where present and an overview of the project was presented. All teams involved have previous data migration experience, which is good.

The project should be completed around mid October, but I felt that management confidence to comply with the date is not very high. This is not a bad signal as one might think, since management showed that it is aware of some real difficulties. Those difficulties may become technical challenges and may delay the project.
Nevertheless, I do believe that the original due date can be achieved.

There will be two phases. The first phase, the current one, will use a small entity subset in order to test the performance of the ETL and to tune the process.
The second phase will be the continuity of the previous phase, but now including all applications and entities.

The project itself consists of migrating the current insurance application, running on OS/390 mainframe, into the new system, the GIS Non-Life, running on AS/400, all using DB2, SAM and VSAM.
There are some other satellite applications running on SQL Server 2005 on Windows.
The ETL tool will be Data Fusion and it will run on AS/400 under the Qshell.

The ETL process will be implemented in a slightly different way from the usual scenario. Usually the data is extracted from the source database and loaded into the target database. This is not the case in this project.
Eventually, the data will be loaded into the final target database, but there's an intermediate step. The ETL process performed by Data Fusion will deliver the data on flat files that will be later consumed by the GIS loader. The GIS loader will validate and load the data into the final database.

Since our (ETL team) experience tells us that performance on AS/400 may be a problem during the ETL process, there was some discussion over the performance tests to be performed during phase one.

If the performance becomes an issue, the data migration may have to be incremental, instead of one-shot. This will be a problem because the source team cannot identify which data has been changed and needs to be migrated again to refresh the target system. One challenge that has been given to us, is to think in a fast way to identify if a record has changed and needs to be migrated again, just in case of a B plan is required.

Data Migration Diary

I'm involved on a new data migration project, and since I'm really interested in this area, I'm starting this blog as a diary.

I won't right entries everyday, since there will be times where only "boring" stuff will happen.
But, I'll write as much as possible about the project management, the technical difficulties, the solutions, the workarounds, the mapping process, the business rules, and, of course, the ETL process.

I have experience in data migration projects in several areas, such as banking and public administration. This will be my first data migration project in the insurance area, which will be quite a challenge because the business rules in this area are very complex.

Due to confidentiality purposes, I'll keep the project sponsor and all parts involved incognito.
The only exception will be the references to the technical stuff involved.

Data Migration Diary

Friday, January 22, 2010

297 - Data Loading Files Performance Problem

Wednesday, November 25, 2009

Day 238 - Migration Status

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

Wednesday, April 1, 2009

Day 1 - Kick off

Data Migration Diary

Search This Blog

Blog Archive

Followers

About Me

Data Migration Diary

Friday, January 22, 2010

297 - Data Loading Files Performance Problem

Wednesday, November 25, 2009

Day 238 - Migration Status

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

Wednesday, April 1, 2009

Day 1 - Kick off

Data Migration Diary

Search This Blog

Blog Archive

Follow by RSS

Followers

About Me