Showing posts with label data cleaning. Show all posts
Showing posts with label data cleaning. Show all posts

Wednesday, November 25, 2009

Day 238 - Migration Status

Finally I got a set of entities that migrated correctly.
It was actually quite simple, once the mapping rules were correctly specified according with the data contents. The main problem is that the source team had, as usual, some difficult to admit that their data is dirty and its quality is actually a lot lower than they expected. In this particular case, it was the zip codes.
Once they admitted that the zip codes needed specific rules because they were dirty, it became an easy migration task.
In the next few days I expect to migrate other entity sets without errors.

The migration performance as an overall is quite slow, even using 5 parallel executions. We got some times and, as already known, the GIS loading procedure is responsible for the biggest migration times.
Usually we perform the ETL cycle within Data Fusion, which means we deliver the data directly to the target database, i. e. usually we would only have the Data Fusion transformation procedure time, but in this particular scenario we have to create a text file which GIS will use to load the data.
Here's an example of the performance times we're having now:

Car Insurance Claims (470.563 records):
  • Data Fusion transformation procedure: 51m
  • File creation for GIS loading: 2h 10m
  • GIS file loading: 238h sequential time (time calculated based on 16h of loading)

Personal Accident Insurance Claims (29.303 records):
  • Data Fusion transformation procedure: 1m 51s
  • File creation for GIS loading: 1m 40s
  • GIS file loading: 2h 17m with 3 parallel processes (5h 9m sequential time)

Entities (682.569 records):
  • Data Fusion transformation procedure: 6m 3s
  • File creation for GIS loading: 23m
  • GIS file loading: 5 h 55m with 5 parallel processes (27h 17m sequential time)

Tonight GIS will be cleaned up and the AS/400 will be tuned for better performance for file reading. In the next few days I'll get some new times, hopefully better.

Thursday, April 9, 2009

Day 9 - Data Profiling

I'm continue to develop the data cleaning function for the honorific titles.
Currently I'm developing a data profiler, a stand alone Java application that queries DB2 via JDBC JTOpen, that has already found some anomalies in the person contact name. There's the expected typos, some ad-hoc markers, and finally, the usual user creativity.

These user creativity anomalies are the funniest of them all. In the person contact name there's random chars, like "USO USSSSS", training data, like "TEST" and totally nonsense person names, like "EXITING NOW", "ALSO EXITING" and "NEW".

Day 8 - Honorific Title Cleaning

I've started to work on a data cleaning function to detect honorific titles in a free text field.
The text field holds the contact name of a person that might be preceded, or followed by, the honorific title.

I'm working on several distinct approaches. The first one is the classical one, where I'm using a dictionary. The second one is automatic where the algorithm will try to infer if an honorific title exists or not. And the last one is an extension of the automatic, using an exclusion dictionary.

In the first tests I've performed I've found a lot of false positives, therefor I'll use a first name dictionary to exclude those.

Tuesday, April 7, 2009

Day 7 - First Mappings

The first mappings have been specified.

I've spent almost all day with the target team and the source teams, one from the main system and another from a secondary system.
It was a productive day, since we were able to map over 90% of both systems, that will migrate into the same target specification. The doubts and problems detected will be easily solved.

As expected, some cleaning will be performed by the trasformation rules. For instance, in some circumstances, the honorific title will have to be automatically inferred from a free form text field.
It will also be necessary to perform duplicate detection and entity fusion before executing the migration.

None of these seems really problematic at this time.