Today the data profiler has received a ranking function, that ranks the profiled data and cuts off data that is considered as discarded.
When data is profiled, a tupple with the term and the context is saved.
The context helps in the evaluation of the term, specially to know if it's a false positive.
The context is also used to cut off data.
It is possible to maintain only terms that have a ranking index, calculated from the term context, equal or higher than a value. Or, it is possible to maintain only terms that differ from the immediately higher rank higher than a giver percentage, i. e., if the current term differs mores than a given percentage from the last term, the current term and all with lower ranks are discarded.
With this enhancements, the data profiler is finished.
Unless there's some special need to profile other data, it is done for now.
Showing posts with label data profiling. Show all posts
Showing posts with label data profiling. Show all posts
Monday, April 13, 2009
Thursday, April 9, 2009
Day 9 - Data Profiling
I'm continue to develop the data cleaning function for the honorific titles.
Currently I'm developing a data profiler, a stand alone Java application that queries DB2 via JDBC JTOpen, that has already found some anomalies in the person contact name. There's the expected typos, some ad-hoc markers, and finally, the usual user creativity.
These user creativity anomalies are the funniest of them all. In the person contact name there's random chars, like "USO USSSSS", training data, like "TEST" and totally nonsense person names, like "EXITING NOW", "ALSO EXITING" and "NEW".
Currently I'm developing a data profiler, a stand alone Java application that queries DB2 via JDBC JTOpen, that has already found some anomalies in the person contact name. There's the expected typos, some ad-hoc markers, and finally, the usual user creativity.
These user creativity anomalies are the funniest of them all. In the person contact name there's random chars, like "USO USSSSS", training data, like "TEST" and totally nonsense person names, like "EXITING NOW", "ALSO EXITING" and "NEW".
Subscribe to:
Posts (Atom)