Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Friday, January 22, 2010

297 - Data Loading Files Performance Problem

The migration scope has increased.
New source data has come into play and thus new mapping specifications have been made.

This has resulted in a really big data loading performance problem.
Our data transformation procedure is still fast, but the creation of the GIS loader files from the transformed data is starting to give us some headaches.

The GIS data loader has a flat file structure that is very verbose, the loading of each single value of each record is done through a 300 character text file line.
This means that a single database table row is transformed into, something like, the same number of text lines as the number of columns the row has. Plus the file and records header and footer structure.

As an example, house insurances data transformation is performed in around 4 hours, sequential time, and it generates about 184 millions of records. This is all performed in the transformation server.
These records are then exported into the GIS data loader file format from the Windows directly into the AS/400. This procedure is now taking much time, over 6 hours, sequentially, in the best case scenario.
This is obviously too much time, so we are exploring several hypotheses, ranging from creating a parallel file write process; writing the files locally, with and without compression, and transfer them via FTP to AS/400; cluster indexes with full coverage maintained in a different disk; splitting the database schemes across several disks.
Some of these techniques can, and will, be combined.

We have to tune the process from our side since there is not much the GIS or the AS/400 can do when it comes to massive data load tuning.
We are facing a lot of hard work in these next days.

Day 293 - Loading Status Queries

It seems the loading procedure feedback queries cannot be optimized any further by us.
We don't even have full comprehension about the data model, since there is no documentation and we had to learn by direct observation and trail and error.

We request help to the GIS team so that they help us in validating the queries and tuning the process.

Wednesday, January 6, 2010

Day 266 - Christmas Break

The project will be stopped for a short Christmas break.
The overall number of data being rejected by the GIS loader is decreasing every day.
But we are now facing a performance problem again. It takes too much time to query GIS in order to know how many errors and what time of errors occurred in the loading procedure. I'll have to take a look at these query performance so that the loading report does not take half-day for a simple auto claim data load.

Day 240 - Cleanup Procedure Status

The AS/400 clean up procedure and tuning procedures has resulted in a better performance.
The performance increase was not substantial, but since the domain includes millions of records, any minor improvement results in an overall improvement.

Wednesday, November 25, 2009

Day 238 - Migration Status

Finally I got a set of entities that migrated correctly.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.

The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:

Car Insurance Claims (470.563 records):
  • Data Fusion transformation procedure: 51m
  • File creation for GIS loading: 2h 10m
  • GIS file loading: 238h sequential time (time calculated based on 16h of loading)

Personal Accident Insurance Claims (29.303 records):
  • Data Fusion transformation procedure: 1m 51s
  • File creation for GIS loading: 1m 40s
  • GIS file loading: 2h 17m with 3 parallel processes (5h 9m sequential time)

Entities (682.569 records):
  • Data Fusion transformation procedure: 6m 3s
  • File creation for GIS loading: 23m
  • GIS file loading: 5 h 55m with 5 parallel processes (27h 17m sequential time)

Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.

Wednesday, November 18, 2009

Day 232 - Loading Perfomance Problem

Finally we were able to start the load of the last weekend data into GIS for testing purposes.
The entities were loaded with less than five thousand rejects on a near one million records.
It is not a bad ratio, but I was expecting a lower rejection rate by now.

The loading of a subset of the car insurance claims, around 450000 records, as also started.
Most of car insurance claims are loading without errors, but a critical problem has raised.
GIS is loading around 2100 records per hour.
This means that 50000 records will require 24 hours to load, and the full set will require an impossible 10 days to be accomplished.
This is sequential time, but even if we use 6 CPUs at a time it will still require more than one day and a half to accomplish this task. Plus, the AS/400 were GIS is running is unavailable 3 hours every night for maintenance procedures.

This is more than a technical challenge, it is a critical situation that will require the involvement of management in the process of finding a solution.

Friday, May 29, 2009

Day 59 - Perfomance Tests on Intel

We've finished the performance tests on the Intel server.
It's an 8 CPU machine with 8 GB of RAM running Windows Server 2003 64 bits. I was hopping to get it running Linux, but that seem not to be possible.

If you recall, our transformation engine it CPU bound and each transformation engine runs on a single CPU, therefor a direct comparison is easy to perform.

The server is powerful enough to hold the transformation engine and the database itself. Therefor, we made several tests combining the servers.
One test case used the AS/400 as the database server and the Intel machine as the transformation engine. On another test case, both the database and the transformation engine were on the Intel machine.
For easiness, the database on the Intel machine is a, real, DB2.

Amazingly, or not, the Intel CPU was a lot faster than the AS/400 CPU. One single Intel CPU runs our allows us to migrate faster than any combination we've tried using the AS/400.

Here's a full graphic showing the results on both systems. It's easy to see that the Intel machine scales very well when we use several processes in parallel.

Tuesday, May 19, 2009

Day 46 - Performance Tests

I know it's Saturday, but a lot of data migrations work is done during the weekends. Specially during the last weeks before the end of the project.

But this Saturday, we were focused on performance tests. We had the AS/400 entirely for us for about 3 hours.
We executed performance tests for the transformation rules engine and for the GIS data loader.

As excepted, the performance increased both with a single process and with multiple processes. for both systems.
In the transformation engine we've executed up to 6 parallel migrations, but the maximum gain was below that, the 4th or 5th, depending on the transformation features, more I/O bound or more CPU bound.

There was not time to perform one interesting test though, running the transformation engine at the same time as the GIS data loader.

Nevertheless, the outcome was what we expected: to have a dedicated Intel machine running Linux for the exclusive usage of the Data Fusion. This will allow GIS data loader to have more resources, since we're not competing.

Here's some nice charts from the tests performed.