Data Migration Diary: November 2009

Wednesday, November 25, 2009

Day 238 - Migration Status

Finally I got a set of entities that migrated correctly.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.

The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:

Car Insurance Claims (470.563 records):

Data Fusion transformation procedure: 51m
File creation for GIS loading: 2h 10m
GIS file loading: 238h sequential time (time calculated based on 16h of loading)

Personal Accident Insurance Claims (29.303 records):

Data Fusion transformation procedure: 1m 51s
File creation for GIS loading: 1m 40s
GIS file loading: 2h 17m with 3 parallel processes (5h 9m sequential time)

Entities (682.569 records):

Data Fusion transformation procedure: 6m 3s
File creation for GIS loading: 23m
GIS file loading: 5 h 55m with 5 parallel processes (27h 17m sequential time)

Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.

Wednesday, November 18, 2009

Day 232 - Loading Perfomance Problem

Finally we were able to start the load of the last weekend data into GIS for testing purposes.
The entities were loaded with less than five thousand rejects on a near one million records.
It is not a bad ratio, but I was expecting a lower rejection rate by now.

The loading of a subset of the car insurance claims, around 450000 records, as also started.
Most of car insurance claims are loading without errors, but a critical problem has raised.
GIS is loading around 2100 records per hour.
This means that 50000 records will require 24 hours to load, and the full set will require an impossible 10 days to be accomplished.
This is sequential time, but even if we use 6 CPUs at a time it will still require more than one day and a half to accomplish this task. Plus, the AS/400 were GIS is running is unavailable 3 hours every night for maintenance procedures.

This is more than a technical challenge, it is a critical situation that will require the involvement of management in the process of finding a solution.

Tuesday, November 17, 2009

Day 230 - Test Team Assembled

Finally the test team has been defined and the tests will start, with a a two week delay.

Unfortunately, the data we've tried to load into GIS this weekend was almost entirely rejected.
We don't know what has happened but the loading logs indicate a change in the system configuration. This means that almost all the data that was already being loaded without problems, was rejected!

There is also a great side effect about this, the testing team has no data to test.

We hope this will be fixed soon.

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

We have a functional requirement that needed our technical expertise to be solved, it was a small challenge but an interesting one.

Data Fusion already has a convert sequence function, which receives a key a returns a sequence.
It is very useful to convert things like keys and identifiers between the old system and the new system.
It works on a very simple basis: the new system key is requested based on the old system key, if it exists it is returned, otherwise it is created based on the addition of 1 to the current maximum value present, and the returned.
Here is an example:

Old Key
ACC0005
ACC0008
ACC0017
[...]

New Key
1
2
3
[...]

This is persistent between data migration executions, which means only new keys will be added, and it is quite efficient since it is entirely loaded into memory.

This has suit us for almost a decade. All our previous data migrations never required identifier conversions, or creations, based on anything other than a set of identifiers.
But here a new challenge raised. We still must return the new system key based on the old system key, but the creation of a new system key is no longer based on the old system key.
The creation of the new system key is now based on a subset of the old system key.
Here is how it works:

Request the new system key based on the old system key
If if exists: return the value
If if does not exist:

Select the next value from the old system sequence key
Add an entry to the conversion table registering the old system key, the old system sequence key and the new system key
Return the new system key

Here is a very simple example for better understanding.
Consider the old system key to be based on the source system account number (ACC) and historical image (IMG). Consider the sequence being incremented only by the account number. A conversion table would be something like:

Old Key
ACC0005 # IMG0001
ACC0005 # IMG0002
ACC0008 # IMG0001
ACC0017 # IMG0001
ACC0017 # IMG0002
ACC0017 # IMG0003
[...]

Old Sequence Key
ACC0005
ACC0005
ACC0008
ACC0017
ACC0017
ACC0017
[...]

New Key
1
2
1
1
2
3
[...]

Note that the old system sequence key varies based on a subset of the old system key.

Obviously the above example is purely academic, but the this easy to use mechanism is being used to solve a complex transformation problem and it has became extremely useful.

Monday, November 2, 2009

Day 212 - Replaning

The data migration is just one item in the entire project, and since the main project is always changing, it was time to re-plan the data migration sub-project.
The result of such action showed that we're around 5% behind schedule.

The problem is not the number, it is not a worrying number since it can be recovered in one week.
The problem is the reason for that number.
We are behind schedule because the project sponsor is unable to close data migration tasks.
We do several rounds of functional mapping, every time we left some unmapped fields because doubts exists, which is totally normal, but then inertia comes in!
If we do not push the project sponsor teams by making extra mapping sessions, they will not move and will not find the answers for the doubts.
We could live with this model previously, since we did manage to get time for that, but now we have our hands full so we cannot participate in extra mappings sessions.

Data Migration Diary

Wednesday, November 25, 2009

Day 238 - Migration Status

Wednesday, November 18, 2009

Day 232 - Loading Perfomance Problem

Tuesday, November 17, 2009

Day 230 - Test Team Assembled

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

Monday, November 2, 2009

Day 212 - Replaning

Search This Blog

Blog Archive

Followers

About Me

Data Migration Diary

Wednesday, November 25, 2009

Day 238 - Migration Status

Wednesday, November 18, 2009

Day 232 - Loading Perfomance Problem

Tuesday, November 17, 2009

Day 230 - Test Team Assembled

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

Monday, November 2, 2009

Day 212 - Replaning

Search This Blog

Blog Archive

Follow by RSS

Followers

About Me