Data Migration Diary: 2009

Wednesday, November 25, 2009

Day 238 - Migration Status

Finally I got a set of entities that migrated correctly.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.

The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:

Car Insurance Claims (470.563 records):

Data Fusion transformation procedure: 51m
File creation for GIS loading: 2h 10m
GIS file loading: 238h sequential time (time calculated based on 16h of loading)

Personal Accident Insurance Claims (29.303 records):

Data Fusion transformation procedure: 1m 51s
File creation for GIS loading: 1m 40s
GIS file loading: 2h 17m with 3 parallel processes (5h 9m sequential time)

Entities (682.569 records):

Data Fusion transformation procedure: 6m 3s
File creation for GIS loading: 23m
GIS file loading: 5 h 55m with 5 parallel processes (27h 17m sequential time)

Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.

Wednesday, November 18, 2009

Day 232 - Loading Perfomance Problem

Finally we were able to start the load of the last weekend data into GIS for testing purposes.
The entities were loaded with less than five thousand rejects on a near one million records.
It is not a bad ratio, but I was expecting a lower rejection rate by now.

The loading of a subset of the car insurance claims, around 450000 records, as also started.
Most of car insurance claims are loading without errors, but a critical problem has raised.
GIS is loading around 2100 records per hour.
This means that 50000 records will require 24 hours to load, and the full set will require an impossible 10 days to be accomplished.
This is sequential time, but even if we use 6 CPUs at a time it will still require more than one day and a half to accomplish this task. Plus, the AS/400 were GIS is running is unavailable 3 hours every night for maintenance procedures.

This is more than a technical challenge, it is a critical situation that will require the involvement of management in the process of finding a solution.

Tuesday, November 17, 2009

Day 230 - Test Team Assembled

Finally the test team has been defined and the tests will start, with a a two week delay.

Unfortunately, the data we've tried to load into GIS this weekend was almost entirely rejected.
We don't know what has happened but the loading logs indicate a change in the system configuration. This means that almost all the data that was already being loaded without problems, was rejected!

There is also a great side effect about this, the testing team has no data to test.

We hope this will be fixed soon.

Thursday, November 5, 2009

Day 219 - New Convert Sequence Function

We have a functional requirement that needed our technical expertise to be solved, it was a small challenge but an interesting one.

Data Fusion already has a convert sequence function, which receives a key a returns a sequence.
It is very useful to convert things like keys and identifiers between the old system and the new system.
It works on a very simple basis: the new system key is requested based on the old system key, if it exists it is returned, otherwise it is created based on the addition of 1 to the current maximum value present, and the returned.
Here is an example:

Old Key
ACC0005
ACC0008
ACC0017
[...]

New Key
1
2
3
[...]

This is persistent between data migration executions, which means only new keys will be added, and it is quite efficient since it is entirely loaded into memory.

This has suit us for almost a decade. All our previous data migrations never required identifier conversions, or creations, based on anything other than a set of identifiers.
But here a new challenge raised. We still must return the new system key based on the old system key, but the creation of a new system key is no longer based on the old system key.
The creation of the new system key is now based on a subset of the old system key.
Here is how it works:

Request the new system key based on the old system key
If if exists: return the value
If if does not exist:

Select the next value from the old system sequence key
Add an entry to the conversion table registering the old system key, the old system sequence key and the new system key
Return the new system key

Here is a very simple example for better understanding.
Consider the old system key to be based on the source system account number (ACC) and historical image (IMG). Consider the sequence being incremented only by the account number. A conversion table would be something like:

Old Key
ACC0005 # IMG0001
ACC0005 # IMG0002
ACC0008 # IMG0001
ACC0017 # IMG0001
ACC0017 # IMG0002
ACC0017 # IMG0003
[...]

Old Sequence Key
ACC0005
ACC0005
ACC0008
ACC0017
ACC0017
ACC0017
[...]

New Key
1
2
1
1
2
3
[...]

Note that the old system sequence key varies based on a subset of the old system key.

Obviously the above example is purely academic, but the this easy to use mechanism is being used to solve a complex transformation problem and it has became extremely useful.

Monday, November 2, 2009

Day 212 - Replaning

The data migration is just one item in the entire project, and since the main project is always changing, it was time to re-plan the data migration sub-project.
The result of such action showed that we're around 5% behind schedule.

The problem is not the number, it is not a worrying number since it can be recovered in one week.
The problem is the reason for that number.
We are behind schedule because the project sponsor is unable to close data migration tasks.
We do several rounds of functional mapping, every time we left some unmapped fields because doubts exists, which is totally normal, but then inertia comes in!
If we do not push the project sponsor teams by making extra mapping sessions, they will not move and will not find the answers for the doubts.
We could live with this model previously, since we did manage to get time for that, but now we have our hands full so we cannot participate in extra mappings sessions.

Friday, October 23, 2009

Day 205 - First Change Requests

As expected, lots of change requests and lots of mapping meetings are taking place this week.
My guess is that by December, the maximum number of change requests covered by the contract will be reached. But this is just my guess...

Wednesday, October 21, 2009

Day 202 - Firts Entity Migration Test Results

As expected, we did had a low error rate.
Overall, we got less than 1% of data that has been rejected, meaning that 99% of the data has been correctly migrated.

The numbers say it all, less than 6000 records rejected over 6 different types of errors.

We had the data slit into 11 files, to allow parallel loading, which took an average of 45 minutes each file.
The bad news is that the GIS loading procedure took, sequentially, 490 minutes and it hardly will be capable of loading more than 4 files at the same time.

As always, the next days will result on transformation rules change requests.

Day 199 - Firts Entity Migration Test

Today the entities, clients only, will be migrated for the first time for real.
The data transformation process itself is quite fast, less than 500000 records will be transformed in less than 30 minutes, but them we have to create structured files that gather all the transformed information into GIS the loading file format, and that will take around an hour.

I believe in a low error number from this first real client migration test, but the results will only be available Monday.

Tuesday, October 6, 2009

Day 182 - New Sponsor Team Member

Management has quickly replaced the mapping team member that left.
We performed a mapping session, and he knows well the source system from the technical side but knows less about the business and he cannot take decisions.
Some of the current sponsor mapping team members will attend these mapping sessions, they know the business and they can make decisions.

This seems enough to solve the problem.

Day 178 - Team Member Change

One of the critical sponsor mapping team member is going away.
She's changed jobs and that is a critical issue to the project, since she is one of the most knowledgeable sponsor people, she knows well the business and the source system, and she is one of the most active and pro-active people in the team.
The problem is that she's going away tomorrow.
Management has to do something about it quick.

Day 174 - Focus on Mapping

It seems that the entire sponsor team has come back from the holidays and we'll be able to perform the functional mapping faster.
If we don't start gaining some speed, the project will not comply with the plan and the project timings will slip.

Wednesday, September 9, 2009

Day 154 - New Team Member

The team has just grown up, we have now a new member that will be responsible for the daily runs and the tests.

Day 153 - Back To Work

The vacations are over for us, but some sponsor people are still on vacations, including management, so it's a slow start up.

This return has some new changes, we no longer have a partner on the data migration task, so we are now entirely responsible for the data migration.
This means that we are now doing what we know best: data migrations at all levels.

We had a problem with the DB2 license, seems like it has expired during the vacations, but this will soon be solved, since the system administrators are also back to work.

Monday, August 3, 2009

Day 121 - Interview

Just before vacations, I was called up to interview the last team member.
There is a time constraint and it's been really hard to find good people on the market.
I was not ready for the interview, since the phone call came during lunch, so I had just a couple of hours to prepare the interview.

The goal is to hire the last team member. The role is actually quite easy. All that is required is to make internal tests, through SQL queries, and daily runs and gather the log results in a report, so that we know what's wrong and fix it.

The interview went OK.
She knows her way around SQL, SQL Server mostly, and the team will have to make an extra effort to help her to build the daily run script and error report.

So, in September she'll integrate the team and we're ready to make another data migration.

Friday, July 24, 2009

Day 115 - Proposal Approved

Finally the proposal has been officially approved by the project sponsor.

The sponsor wanted to deal directly with us since it became clear that our partner would not add any value to the project. Once this has become official, the project sponsor wanted to deal only with us and that was what happened.

The project will still be stopped during August for vacations and we'll resume it in August 31st, with the new approved rules that came out from the budget cut.

Wednesday, July 8, 2009

Day 98 - New Proposal

The project is stopped, but there's still management work to do.
The political battle, with the consequent interruption in July, resulted from the fact that the proposal price was considered too high by the sponsor and our partner.

The project has been planned according with Data Fusion methodology and our know-how, which is over 10 years. I recall that we are data migration specialists, I don't have number but I believe there's not many people who have migrated as many banks as we did nor as faster as we do it. This said, we know what we're doing but obviously the sponsor, and unfortunately the partner, does not.

This said, the proposal has been presented according to what is most convenient to the client.
But this time we had to cut costs and in order to do it, we had to cut one resource. This means we've rethought the planing with one less specialist. But in order to do it, we had to force some constraints:

The client must give the functional mapping rules in up to three blocks. The first block must have the minimum required in order to be possible to execute the mapping. The second block must cover at least 75% of the rules and the third block must be complete.
There's a dead line to deliver the last block. It depends with the complexity and development time of the area being mapped. It usually must not exceed 4 weeks after the first block delivery, which must be delivered by the end of the initial week of the area being mapped.
The functional mapping deliveries must be of high quality. This means that there's a maximum number of changes programmed. This threshold has been calculated as 25% of the total number of mapping rules, e.g. a project that has 1000 mapping rules can have up to 250 change requests. If the client requests changes above the maximum number of changes planned, that will cost him extra.

We had to include these two rules because we need to have high quality functional mappings delivered fast and we can't spent much time changing recoding the rules, that's why there's a threshold on the number of changes. There's one less resource, so there must be less work to do.

The main difference from our regular methodology is commodity for the client.
In the usual model, the client can request as many change requests as he wants and we, usually, accept that functional mappings are finished late in the project.
This has obvious advantages for the client in contrast with a limited number of changes and no late deliveries.

I think this is a great example were the client has not been well treated, since the price battle actually was more a problem for our partner than for the sponsor.
But the main goal has been achieved, the proposal is now cheaper, with an obvious cost.

Wednesday, July 1, 2009

Day 91 - Project Stop for Two Months

For political and economical reasons, management has decided to interrupt the project in July.This, added to the August vacations interruption, means that the project will be interrupted for two consecutive months.

This was totally out of the blue! The team just had come up from lunch and was told "no need to come tomorrow"!
This is also totally absurd, since even the project sponsor was a bit upset with this.
This was probably one of the worst examples I've seen where political wars have interfere with a project and the client directly suffered with it.

Unless there's some, also out of the blue, news against this, we'll be back only in September 1.

Day 88 - Mappings at Cruse Speed

Insurance claims mapping is rolling at cruse speed and we're already implementing some of them.
There's already other areas eager to start the functional mapping, but we've decided to close the insurance claims before starting up new areas.

Since the due date has been shifted from October to April next year, the project sponsor decided to interrupt the project in August for a full month vacation. This makes sense since October was only possible if the sponsor, and ourselves, worked full time, meaning no vacation for anyone.

Tuesday, June 23, 2009

Day 84 - DB2 JDBC Useful Setting

When using the DB2 Command Editor, there's always a lot more error information available than using any other generic database tool.
Why is that, we've inquired.

We've dig a little and we found out that there's a connection string parameter that one can enable that will allow any generic database tool. The parameter is retrieveMessagesFromServerOnGetMessage and must be included in directly in the connection string:

jdbc:db2://{server}:{port}/{database}:retrieveMessagesFromServerOnGetMessage=true;

Please note that the connection string does not follow the standard syntax. It has a : instead of a ;, that is not a mistake, that's relly like that.

./M6

Monday, June 22, 2009

Day 83 - Data Loaded into DB2 on Windows

We have finally received the database data from OS/390 in a format that we were able to load it without trouble in DB2 on Windows.

Some of the weird problems we got on AS/400 will no longer bother us but now, we have to rewrite some scripts because DB2 in Windows is different from DB2 on AS/400.

One of the problems was that I an insert into script that loads the postal code with almost 300000 records. I had to loaded it through the DB2 command line tool. Here's how one can do it:

Open the DB2 Command Line Processor tool. Don' t just open a regular shell since it will not work, some environment settings are required.
In the shell, execute the following commands:

db2 connect to {database} user {username} using {password}
db2 -vf {path/script.sql} -t > output.txt

Don't forget to have a tail and grep installed, if you wish to track the evolution of the script execution and check for operation success.

The entire process took about 6 hours, this includes taking the data from OS/390, moving it to Windows, load it into DB2 in Windows and preparing the staging area with look up and convert tables. Way to much time. This process will definitely be optimized.

We've also found out one of the most idiotic functionality in software I've ever seen.
After installing the database client on the local machines, and after configuring the server access, the Command Editor only allows one to open files from the database server computer. Yes, I mean it is impossible to open any local file using the Command Editor. And no, that is not configurable!

Friday, June 19, 2009

Day 79 - Insurance Claims Mapping

Finally there was a new mapping for insurance claims programmed.
The meeting went great and we're already producing mappings. This work will continue for about one more week, maybe 3 or 4 more mapping meetings and we're done with it.

Monday, June 15, 2009

Day 76 - Back from Vacation

I've just came back from a one week vacation.
Unfortunately there's no real news.

It was still not possible to load data into the new DB2 on Windows. Since some field values hold the " character, the DB2 LOAD messes up the data when loading. A new field qualifier will be tried soon.

There was also a kick-off meeting for a new mapping for insurance claims programmed, but unfortunately it did not happened.

I still hope to have new mappings during this week, though.

Friday, May 29, 2009

Day 59 - Perfomance Tests on Intel

We've finished the performance tests on the Intel server.
It's an 8 CPU machine with 8 GB of RAM running Windows Server 2003 64 bits. I was hopping to get it running Linux, but that seem not to be possible.

If you recall, our transformation engine it CPU bound and each transformation engine runs on a single CPU, therefor a direct comparison is easy to perform.

The server is powerful enough to hold the transformation engine and the database itself. Therefor, we made several tests combining the servers.
One test case used the AS/400 as the database server and the Intel machine as the transformation engine. On another test case, both the database and the transformation engine were on the Intel machine.
For easiness, the database on the Intel machine is a, real, DB2.

Amazingly, or not, the Intel CPU was a lot faster than the AS/400 CPU. One single Intel CPU runs our allows us to migrate faster than any combination we've tried using the AS/400.

Here's a full graphic showing the results on both systems. It's easy to see that the Intel machine scales very well when we use several processes in parallel.

Day 58 - New Mapping

A new mapping meeting has been performed.
The new mapping is easy, since it's actually the same one that's already implemented.

Unfortunately, new mappings will not be available soon. Looks like the target system specification is way long to be closed, and I'm getting the impression that the project timings will slide...

Monday, May 25, 2009

Day 50 - New Mappings Posponed

Since the first stage has over, we're now expecting for the meetings to map the new entities.
Unfortunately, the target team still does not have the product fully configured, it looks like the requirements are not fully closed yet.
Therefor, the new mappings have been postponed.

./M6

Tuesday, May 19, 2009

Day 46 - Performance Tests

I know it's Saturday, but a lot of data migrations work is done during the weekends. Specially during the last weeks before the end of the project.

But this Saturday, we were focused on performance tests. We had the AS/400 entirely for us for about 3 hours.
We executed performance tests for the transformation rules engine and for the GIS data loader.

As excepted, the performance increased both with a single process and with multiple processes. for both systems.
In the transformation engine we've executed up to 6 parallel migrations, but the maximum gain was below that, the 4th or 5th, depending on the transformation features, more I/O bound or more CPU bound.

There was not time to perform one interesting test though, running the transformation engine at the same time as the GIS data loader.

Nevertheless, the outcome was what we expected: to have a dedicated Intel machine running Linux for the exclusive usage of the Data Fusion. This will allow GIS data loader to have more resources, since we're not competing.

Here's some nice charts from the tests performed.

Monday, May 11, 2009

Day 41 - DB2 Problems, Part 2

As expected, and as I've logged on Day 28, DB2 on AS/400 is giving us a bad time.

DB2 on AS/400 is not the "real" DB2. Is something awkward, that usually behaves like DB2 and sometimes does not behave at all.

We've used DB2 on other data migrations and we've only had problems on AS/400 version.

To get a glimpse, here's some problems we're having:

trying to create a view is impossible since DB2 returns a generic error saying it is not possible to complete the command.
trying to select specific fields from a view is impossible since DB2 returns an erro saying it does not find such fields.

One could think I may have extremly complex queries, well, judge for yourself (real column and table names have been changed).

Running the following query works fine:

select substr(ZIP1, 1, 2) ZIPinit, city,
       (ZIP1 || '-' || ZIP2) as zip_code
from (
    select ZIP1, ZIP2, city, 
    row_number() over (partition by substr(ZIP1, 1, 2), city 
    order by ZIP1, ZIP2, city) as ZIP_ORDER
from LIBWRK.ZIP_CODES
) as ZIPs

but creating a view over it is impossible:

create view LIBWRK.ZIP_APROXIMADO as
  select  substr(ZIP1, 1, 2) ZIPinit, city,
         (ZIP1 || '-' || ZIP2) as zip_code
  from (
    select ZIP1, ZIP2, city,
      row_number() over (partition by substr(ZIP1, 1, 2), city
      order by ZIP1, ZIP2, city) as ZIP_ORDER
      from LIBWRK.ZIP_CODES
  ) as ZIPs
  where ZIP_ORDER = 1

Also, it is not possible to perform all queries over this view:

create view LIBWRK.ZIP_CONVERTION as (
  select ZIP.ZIP1 as ZIP1, ZIP.ZIP2 as ZIP2, 
    ZIP.cityDescricao as cityDescription,
    ZIP1 || '-' || ZIP2 as ZIP_CODE,
    substr(ZIP1, 1, 2) as ZIPinit,
    case 
       when ZIP.ZIP2 = '000' then 
          ZIP1 || '-' 
          || (select min(x.ZIP2)
              from LIBWRK.ZIP_CODES as x 
              where x.ZIP2 <> '000' and x.ZIP1 = ZIP.ZIP1)
       else ZIP1 || '-' || ZIP2
    end as zip_code,
    ZIP.NormalizedCity as NormalizedCity,
    row_number() over (
             partition by substr(ZIP1, 1, 2), NormalizedCity
    order by ZIP1, ZIP2, NormalizedCity) as ZIP_ORDER
    from LIBWRK.ZIP_CODES as ZIP
);

select * from LIBWRK.ZIP_CONVERTION

works fine, but

select ZIP_CODE from LIBWRK.ZIP_CONVERTION

does not work at all.

But DB2 on AS/400 has a lot more problems. We've still not found them here but have runned into this in previous data migrations:

query exploded when running a query: sometimes the query runs fine, some times it explodes, and yes, "query exploded" is the exact term that DB2 writes on its log;
problems with summary querys: the optimizer was unnable to optimize de query and it tooks so much time it it becomes impossible to run such queries.

The "pragmatic" solution for all these problems has been creating tables populated with the queries result. In cases like the summaries, it was necessary to build sereveal summary tables with small subsets of the data and then create a view, if possible, or a table that joinned all the pre-calculated data.

Tuesday, May 5, 2009

Day 34 - Denial Phase

The meeting with the project sponsor went almost as expected.
They loved the idea of having a separate server to execute the transformation rules, but the AS/400 administrators are in denial.
AS/400 administrators don't want to believe that AS/400 is not that as fas as they all brag about. If I had spent as much money as an AS/400 costs, I would be denying those facts too... It's just too painful to hear that in certain tasks, the AS/400 can be beaten by a much cheaper 8 CPU Intel machine running Linux...

By the way, the AS/400 were we're testing, the development environment, has the following features:

iSeries 570
1 CPU Power 5 at 1.65GHz
3000 CPW
10GB RAM DDR1
Internal HD: 1.6TB

The AS/400 administrators now want us to perform tests on the qualification machine:

System i Power 6 570
1 CPU a 4.7GHz (maximum of 4 processors)
5000 CPW (maximum of 20000 CPWs)
20GB RAM DDR2 400MHz (maximum of 90GB)
External HD: 2.4TB

I bet the test will turn out the same results, since the transformation engine is CPU bound, but nevertheless, we'll do the test during a weekend where there's no activity on the machine.

./M6

Thursday, April 30, 2009

Day 30 - Performace Tests

There's a set of performance tests to do before choosing the final migration infrastructure.

There's tests over the initial data extraction from OS/390 and loading into AS/400.

There's tests over the data transformation rules.

There's tests over the final data loading into GIS.

From this set of tests, only ours, the data transformation rules, have been performed so far.

As expected, AS/400 is not a fast machine to run Data Fusion transformation engine. We already knew this, but AS/400 system administrators always say we're wrong and that they have a fantastic and fast machine. Their pride has been hurt...

As an example, running entirely in AS/400, 400000 source records have migrated at ~800 records per secord. Note that this was a local data migration execution, the source and target databases are on the same machine as the transformation engine.

But running the same transformation engine on my, not very fast, desktop computer with 1GB of RAM, the same 400000 records migrated at ~1400 records per second. Note that on this test the source data was retrieved from AS/400, transformed on my desktop compute, and sent back to AS/400, i.e. the data had to travel over a common 100 Mbit network.

The protocol loading interface was also subject to tests and it's CPU bounded, We have one single processor, of a 4 processor machine, and it consumes 100% of that CPU. When running on AS/400, it perfomed ~12000 records per second. Running on the desktop it does only ~8000 records per second.

The scalability tests on AS/400, with a single CPU for us, proved that a linear execution is faster than a two parallel executions. We haven't had the chance to test parallel execution with more CPUs.

We've been discussing what was the best approach allow us to scale and one thing is for sure, we need to run the transformation engine outside the AS/400, because on the D day we have to share, as it's happening now, the machine with others, specifically with GIS data loading.

Probably we'll have a 6 or 8 CPU machine running Linux and connected to AS/400 over a 1 Gbit network. This will allow us to run multiple transformation engines and free AS/400 for the other data migration processes.

Day 28 - DB2 Problems

We've already used AS/400 and DB2 on previous migrations.

Our experince tell us that they're not very good when it comes use it as the main data migration machine.

There's a set of problems on DB2 that we were afraid to find again on this AS/400 version. This is the latest AS/400 version and, in the words of the system administrator, currently it may be the fastest AS/400 in Portugal.

Our pessimistic expectations proved to be right. DB2 continues to have problems, with seem to be bugs on its engine.

For instance, executing an query works fine, but creating a view over that same quety is impossible because DB2 crashes with an internal error!

I had to create a new table with data from the resulting query.

I stil have not found the "query exploded" problem, which seem related with DB2 optimizer that crashes when trying to optimize the query.

Our confidance on DB2 on AS/400 is a bit low right now, since we proved to be right when we said that we were expecting problems from it.

Friday, April 24, 2009

Day 24 - New Mapping Rules

There was a new meeting for mapping rule definition. Most of the transformation rules are closed, there's just some tweaks related with foreign addresses and some internal codes.

Next week there will be more full migration cycles with fresh data, since we've discovered that the current test data was not correct.

Meanwhile, performance tests are being performed on AS/400 and we were able to convince the system administratos to reconfigure the anti-virus not to read Java .jar files, allowing a better performance on our desktops.

And yesterday, the desktops were automatically updated and rebooted. This policy has also been removed from our desktops, since it is far too dangerous to perform such operations when a data migration is being controled from our desktops.

Saturday, April 18, 2009

Day 17 - First Try

The protocol loading interface is ready, it's cool but I did not code it.

And I have the first mappings ready to roll.

We've done a full cycle, with about 85% of the transformation rules implemented, and we got less problems than I expected for the first try.

Foreign zip codes were wrong, which we were expecting since this mapping specification is still under discussion.

People names were wrongly specified, since the surname was being lost.

The were some empty honoricif titles, since they are also still under specification.

A couple of bytes were on the wrong place on the protocol file.

We've tried it with one hundred records from the source, which resulted in, aproximatly, 2600 records in the protocol file.

The whole process was executed in our desktop computers, since the AS/400 is still not ready for us, the Java 1.5 is missing. Next week I'm expecting to execute the whole process in the AS/400.

Tuesday, April 14, 2009

Day 14 - First Mapper

I've just set up the Data Fusion IDE and server on my machine. Until we install the server on AS/400, my desktop will have to do the trick. I'm using the JTOpen JDBC driver to connect to AS/400 files via DB2.

In order to create the Data Fusion tables layout automatically, I've set up a simple command line application that connects to AS/400 through the same JTOpen JDBC driver and queries the files/tables metadata, in particular it's structure, the columns name, data type, precision, etc..

I've also implemented some transformation rules to run some tests and everything worked fine.
I'm still waiting for some target specification information from the target team, but I believe that I'll have two mappers ready for some serious testing real soon.

I've also received the production data, which means know I can test my data profiler with real data.

Monday, April 13, 2009

Day 13 - Data Profiler Enhanced

Today the data profiler has received a ranking function, that ranks the profiled data and cuts off data that is considered as discarded.

When data is profiled, a tupple with the term and the context is saved.
The context helps in the evaluation of the term, specially to know if it's a false positive.
The context is also used to cut off data.
It is possible to maintain only terms that have a ranking index, calculated from the term context, equal or higher than a value. Or, it is possible to maintain only terms that differ from the immediately higher rank higher than a giver percentage, i. e., if the current term differs mores than a given percentage from the last term, the current term and all with lower ranks are discarded.

With this enhancements, the data profiler is finished.
Unless there's some special need to profile other data, it is done for now.

Thursday, April 9, 2009

Day 9 - Data Profiling

I'm continue to develop the data cleaning function for the honorific titles.
Currently I'm developing a data profiler, a stand alone Java application that queries DB2 via JDBC JTOpen, that has already found some anomalies in the person contact name. There's the expected typos, some ad-hoc markers, and finally, the usual user creativity.

These user creativity anomalies are the funniest of them all. In the person contact name there's random chars, like "USO USSSSS", training data, like "TEST" and totally nonsense person names, like "EXITING NOW", "ALSO EXITING" and "NEW".

Day 8 - Honorific Title Cleaning

I've started to work on a data cleaning function to detect honorific titles in a free text field.
The text field holds the contact name of a person that might be preceded, or followed by, the honorific title.

I'm working on several distinct approaches. The first one is the classical one, where I'm using a dictionary. The second one is automatic where the algorithm will try to infer if an honorific title exists or not. And the last one is an extension of the automatic, using an exclusion dictionary.

In the first tests I've performed I've found a lot of false positives, therefor I'll use a first name dictionary to exclude those.

Tuesday, April 7, 2009

Day 7 - First Mappings

The first mappings have been specified.

I've spent almost all day with the target team and the source teams, one from the main system and another from a secondary system.
It was a productive day, since we were able to map over 90% of both systems, that will migrate into the same target specification. The doubts and problems detected will be easily solved.

As expected, some cleaning will be performed by the trasformation rules. For instance, in some circumstances, the honorific title will have to be automatically inferred from a free form text field.
It will also be necessary to perform duplicate detection and entity fusion before executing the migration.

None of these seems really problematic at this time.

Monday, April 6, 2009

Day 6 - Setup for First Mapping

The plan for phase has rolled out and we've prepared for the first mapping meeting tomorrow.
We've also set up the development environment on the computer that has arrived.
My computer has not arrived yet, and that will become a problem if it does not arrive in the next two days.
We have also requested for a new data loading into DB2, but this time with the structure will be preserved.

Friday, April 3, 2009

Day 3 - Target specification

The target specification arrived today.
It's an hierarchical structure that will result in a flat file where each line will represent a specific piace of information, identified by a code.
We've started to work on it, but since the specification is more like a meta-model, we had to make some adjustments to comply with such a dynamic structure.
All problems have been solved, they were actually simple decisions that the source team helped to make.
The DB2 model has been created and our scripts turned it into mapping sheets, that will be used in the functional mapping performed by the source and the target teams.

A piece of the source data has become available. It has been extracted from the OS/390 in EBCDIC and in text formats in order to allow some AS/400 loading tests.
The data looks correct, but each row has been loaded as a single record, the copybook rules have not been applied and therefor, it will have to be loaded again. Nevertheless, the fact that the information looks correct is a good enough as the result of the first try.

The first workstation also arrived today, allowing us to install and configure some of the software required for the data migration, and allowing us to access to the development AS/400 machine.

It will not depend on us, but if next week things work out the way we've planed, we'll be able to recover the one day delay.

Day 2 - First delay

The second day of the data migration project was just like any other second day of any other project: the basic stuff requested to the sponsor and to the target team, and guaranteed by the project sponsor, was not ready.
Management tried to get things done, but was unable to comply with the plan, therefor the project delayed one day.

Unfortunately this is so usual that actually is becoming a standard de facto project start up, no matter what technological project it is, it usually starts delayed.

Wednesday, April 1, 2009

Day 1 - Kick off

The project has been officially kicked off.

There was the traditional meeting where all teams where present and an overview of the project was presented. All teams involved have previous data migration experience, which is good.

The project should be completed around mid October, but I felt that management confidence to comply with the date is not very high. This is not a bad signal as one might think, since management showed that it is aware of some real difficulties. Those difficulties may become technical challenges and may delay the project.
Nevertheless, I do believe that the original due date can be achieved.

There will be two phases. The first phase, the current one, will use a small entity subset in order to test the performance of the ETL and to tune the process.
The second phase will be the continuity of the previous phase, but now including all applications and entities.

The project itself consists of migrating the current insurance application, running on OS/390 mainframe, into the new system, the GIS Non-Life, running on AS/400, all using DB2, SAM and VSAM.
There are some other satellite applications running on SQL Server 2005 on Windows.
The ETL tool will be Data Fusion and it will run on AS/400 under the Qshell.

The ETL process will be implemented in a slightly different way from the usual scenario. Usually the data is extracted from the source database and loaded into the target database. This is not the case in this project.
Eventually, the data will be loaded into the final target database, but there's an intermediate step. The ETL process performed by Data Fusion will deliver the data on flat files that will be later consumed by the GIS loader. The GIS loader will validate and load the data into the final database.

Since our (ETL team) experience tells us that performance on AS/400 may be a problem during the ETL process, there was some discussion over the performance tests to be performed during phase one.

If the performance becomes an issue, the data migration may have to be incremental, instead of one-shot. This will be a problem because the source team cannot identify which data has been changed and needs to be migrated again to refresh the target system. One challenge that has been given to us, is to think in a fast way to identify if a record has changed and needs to be migrated again, just in case of a B plan is required.

Data Migration Diary

I'm involved on a new data migration project, and since I'm really interested in this area, I'm starting this blog as a diary.

I won't right entries everyday, since there will be times where only "boring" stuff will happen.
But, I'll write as much as possible about the project management, the technical difficulties, the solutions, the workarounds, the mapping process, the business rules, and, of course, the ETL process.

I have experience in data migration projects in several areas, such as banking and public administration. This will be my first data migration project in the insurance area, which will be quite a challenge because the business rules in this area are very complex.

Due to confidentiality purposes, I'll keep the project sponsor and all parts involved incognito.
The only exception will be the references to the technical stuff involved.