The migration scope has increased.
New source data has come into play and thus new mapping specifications have been made.
This has resulted in a really big data loading performance problem.
Our data transformation procedure is still fast, but the creation of the GIS loader files from the transformed data is starting to give us some headaches.
The GIS data loader has a flat file structure that is very verbose, the loading of each single value of each record is done through a 300 character text file line.
This means that a single database table row is transformed into, something like, the same number of text lines as the number of columns the row has. Plus the file and records header and footer structure.
As an example, house insurances data transformation is performed in around 4 hours, sequential time, and it generates about 184 millions of records. This is all performed in the transformation server.
These records are then exported into the GIS data loader file format from the Windows directly into the AS/400. This procedure is now taking much time, over 6 hours, sequentially, in the best case scenario.
This is obviously too much time, so we are exploring several hypotheses, ranging from creating a parallel file write process; writing the files locally, with and without compression, and transfer them via FTP to AS/400; cluster indexes with full coverage maintained in a different disk; splitting the database schemes across several disks.
Some of these techniques can, and will, be combined.
We have to tune the process from our side since there is not much the GIS or the AS/400 can do when it comes to massive data load tuning.
We are facing a lot of hard work in these next days.
Showing posts with label etl. Show all posts
Showing posts with label etl. Show all posts
Friday, January 22, 2010
Wednesday, November 25, 2009
Day 238 - Migration Status
Finally I got a set of entities that migrated correctly.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.
The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:
Car Insurance Claims (470.563 records):
Personal Accident Insurance Claims (29.303 records):
Entities (682.569 records):
Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.
The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:
Car Insurance Claims (470.563 records):
- Data Fusion transformation procedure: 51m
- File creation for GIS loading: 2h 10m
- GIS file loading: 238h sequential time (time calculated based on 16h of loading)
Personal Accident Insurance Claims (29.303 records):
- Data Fusion transformation procedure: 1m 51s
- File creation for GIS loading: 1m 40s
- GIS file loading: 2h 17m with 3 parallel processes (5h 9m sequential time)
Entities (682.569 records):
- Data Fusion transformation procedure: 6m 3s
- File creation for GIS loading: 23m
- GIS file loading: 5 h 55m with 5 parallel processes (27h 17m sequential time)
Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.
Thursday, November 5, 2009
Day 219 - New Convert Sequence Function
We have a functional requirement that needed our technical expertise to be solved, it was a small challenge but an interesting one.
Data Fusion already has a convert sequence function, which receives a key a returns a sequence.
It is very useful to convert things like keys and identifiers between the old system and the new system.
It works on a very simple basis: the new system key is requested based on the old system key, if it exists it is returned, otherwise it is created based on the addition of 1 to the current maximum value present, and the returned.
Here is an example:
This is persistent between data migration executions, which means only new keys will be added, and it is quite efficient since it is entirely loaded into memory.
This has suit us for almost a decade. All our previous data migrations never required identifier conversions, or creations, based on anything other than a set of identifiers.
But here a new challenge raised. We still must return the new system key based on the old system key, but the creation of a new system key is no longer based on the old system key.
The creation of the new system key is now based on a subset of the old system key.
Here is how it works:
Here is a very simple example for better understanding.
Consider the old system key to be based on the source system account number (ACC) and historical image (IMG). Consider the sequence being incremented only by the account number. A conversion table would be something like:
Note that the old system sequence key varies based on a subset of the old system key.
Obviously the above example is purely academic, but the this easy to use mechanism is being used to solve a complex transformation problem and it has became extremely useful.
Data Fusion already has a convert sequence function, which receives a key a returns a sequence.
It is very useful to convert things like keys and identifiers between the old system and the new system.
It works on a very simple basis: the new system key is requested based on the old system key, if it exists it is returned, otherwise it is created based on the addition of 1 to the current maximum value present, and the returned.
Here is an example:
Old Key ACC0005 ACC0008 ACC0017 [...] | New Key 1 2 3 [...] |
This is persistent between data migration executions, which means only new keys will be added, and it is quite efficient since it is entirely loaded into memory.
This has suit us for almost a decade. All our previous data migrations never required identifier conversions, or creations, based on anything other than a set of identifiers.
But here a new challenge raised. We still must return the new system key based on the old system key, but the creation of a new system key is no longer based on the old system key.
The creation of the new system key is now based on a subset of the old system key.
Here is how it works:
- Request the new system key based on the old system key
- If if exists: return the value
- If if does not exist:
- Select the next value from the old system sequence key
- Add an entry to the conversion table registering the old system key, the old system sequence key and the new system key
- Return the new system key
Here is a very simple example for better understanding.
Consider the old system key to be based on the source system account number (ACC) and historical image (IMG). Consider the sequence being incremented only by the account number. A conversion table would be something like:
Old Key ACC0005 # IMG0001 ACC0005 # IMG0002 ACC0008 # IMG0001 ACC0017 # IMG0001 ACC0017 # IMG0002 ACC0017 # IMG0003 [...] | Old Sequence Key ACC0005 ACC0005 ACC0008 ACC0017 ACC0017 ACC0017 [...] | New Key 1 2 1 1 2 3 [...] |
Obviously the above example is purely academic, but the this easy to use mechanism is being used to solve a complex transformation problem and it has became extremely useful.
Wednesday, April 1, 2009
Day 1 - Kick off
The project has been officially kicked off.
There was the traditional meeting where all teams where present and an overview of the project was presented. All teams involved have previous data migration experience, which is good.
The project should be completed around mid October, but I felt that management confidence to comply with the date is not very high. This is not a bad signal as one might think, since management showed that it is aware of some real difficulties. Those difficulties may become technical challenges and may delay the project.
Nevertheless, I do believe that the original due date can be achieved.
There will be two phases. The first phase, the current one, will use a small entity subset in order to test the performance of the ETL and to tune the process.
The second phase will be the continuity of the previous phase, but now including all applications and entities.
The project itself consists of migrating the current insurance application, running on OS/390 mainframe, into the new system, the GIS Non-Life, running on AS/400, all using DB2, SAM and VSAM.
There are some other satellite applications running on SQL Server 2005 on Windows.
The ETL tool will be Data Fusion and it will run on AS/400 under the Qshell.
The ETL process will be implemented in a slightly different way from the usual scenario. Usually the data is extracted from the source database and loaded into the target database. This is not the case in this project.
Eventually, the data will be loaded into the final target database, but there's an intermediate step. The ETL process performed by Data Fusion will deliver the data on flat files that will be later consumed by the GIS loader. The GIS loader will validate and load the data into the final database.
Since our (ETL team) experience tells us that performance on AS/400 may be a problem during the ETL process, there was some discussion over the performance tests to be performed during phase one.
If the performance becomes an issue, the data migration may have to be incremental, instead of one-shot. This will be a problem because the source team cannot identify which data has been changed and needs to be migrated again to refresh the target system. One challenge that has been given to us, is to think in a fast way to identify if a record has changed and needs to be migrated again, just in case of a B plan is required.
There was the traditional meeting where all teams where present and an overview of the project was presented. All teams involved have previous data migration experience, which is good.
The project should be completed around mid October, but I felt that management confidence to comply with the date is not very high. This is not a bad signal as one might think, since management showed that it is aware of some real difficulties. Those difficulties may become technical challenges and may delay the project.
Nevertheless, I do believe that the original due date can be achieved.
There will be two phases. The first phase, the current one, will use a small entity subset in order to test the performance of the ETL and to tune the process.
The second phase will be the continuity of the previous phase, but now including all applications and entities.
The project itself consists of migrating the current insurance application, running on OS/390 mainframe, into the new system, the GIS Non-Life, running on AS/400, all using DB2, SAM and VSAM.
There are some other satellite applications running on SQL Server 2005 on Windows.
The ETL tool will be Data Fusion and it will run on AS/400 under the Qshell.
The ETL process will be implemented in a slightly different way from the usual scenario. Usually the data is extracted from the source database and loaded into the target database. This is not the case in this project.
Eventually, the data will be loaded into the final target database, but there's an intermediate step. The ETL process performed by Data Fusion will deliver the data on flat files that will be later consumed by the GIS loader. The GIS loader will validate and load the data into the final database.
Since our (ETL team) experience tells us that performance on AS/400 may be a problem during the ETL process, there was some discussion over the performance tests to be performed during phase one.
If the performance becomes an issue, the data migration may have to be incremental, instead of one-shot. This will be a problem because the source team cannot identify which data has been changed and needs to be migrated again to refresh the target system. One challenge that has been given to us, is to think in a fast way to identify if a record has changed and needs to be migrated again, just in case of a B plan is required.
Labels:
as/400,
data,
data fusion,
data migration,
db2,
etl,
gis,
migration,
os/390,
sql server
Data Migration Diary
I'm involved on a new data migration project, and since I'm really interested in this area, I'm starting this blog as a diary.
I won't right entries everyday, since there will be times where only "boring" stuff will happen.
But, I'll write as much as possible about the project management, the technical difficulties, the solutions, the workarounds, the mapping process, the business rules, and, of course, the ETL process.
I have experience in data migration projects in several areas, such as banking and public administration. This will be my first data migration project in the insurance area, which will be quite a challenge because the business rules in this area are very complex.
Due to confidentiality purposes, I'll keep the project sponsor and all parts involved incognito.
The only exception will be the references to the technical stuff involved.
I won't right entries everyday, since there will be times where only "boring" stuff will happen.
But, I'll write as much as possible about the project management, the technical difficulties, the solutions, the workarounds, the mapping process, the business rules, and, of course, the ETL process.
I have experience in data migration projects in several areas, such as banking and public administration. This will be my first data migration project in the insurance area, which will be quite a challenge because the business rules in this area are very complex.
Due to confidentiality purposes, I'll keep the project sponsor and all parts involved incognito.
The only exception will be the references to the technical stuff involved.
Subscribe to:
Posts (Atom)