Data Science - Data Cleansing & Curation

Learn Aster
Teradata Employee

"It's not where you take things from, it's where you take them to - Jean-Luc Godard"

Data Cleansing & Curation plays an integral role in any analytic infrastructure.

Definitions of Data Cleansing and Curation (from Wikipedia):

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.

Data curation is a term used to indicate management activities related to organization and integration of data collected from various sources, annotation of the data, and publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation

Fundamental difference between Cleansing and Curation is that, cleansing largely deals with the data itself and Curation has to do with organizing the metadata, so it's easy for the data scientists to locate and reference the data source. If Cleansing is like editing/proof reading a book, Curation is more like organizing a library so books can be easily found and referenced.

There is common misconception that data cleansing is front loaded and once done, analytics can be performed and insights extracted by the green haired people with that data. But if you talk to a data scientist who advises the business everyday on *new* insights and trends, you might hear a slightly different story. Data cleansing in fact takes two different forms - one during adhoc insight discovery and when that gets operationalized (production) as part of a repeatable analytic process. This blog post is mostly about discussing this.

Data Cleansing in Discovery:

Data Cleansing has traditionally played a role to get data whipped up in shape, like joining data from multiple source systems, identifying missing timestamps/data, wrong time zones, duplicate data, correcting errors in the source systems, creating surrogate keys, ripping key fields out of JSON etc. The roots are from the BI framework, where BI tools dip into well organized data models in a relational database to pull BI insights to a user. There are a lot of vendors + open source tools in the market that try to support data wrangling, to 'fix' the data to be consumable by a data scientist.

However in the machine learning/analytics/discovery projects, cleansing takes additional meanings (in addition to traditional prep).

  • Removing the noise from the signal. Noise is an overloaded term that "cannot" be defined a lot of times prior to the data scientist seeing the data - what's important and what's not important. Noise distorts predictive models.
  • Strengthening the weak signals to create good models. What is strong and what is weakly found ?
  • Finding what might be missing to piece the story together and requesting IT for more data that would help for more relevant insights
  • Characterizing outliers for the problem, beyond naive min and max fencing.

The situation is very akin to what a crime scene investigator would do in a crime scene asking everyone not to disturb anything - until a couple of hypothesis bulbs go off in his/her head and every piece of raw evidence examined with surrounding context.

Also, sometimes investigators would want to check on couple of additional sources, other than what's in the scene as well. In our analogy, this is where a data scientist would ask IT for "more" data sources ...

When not to put the cart in front of the horse:

Data Science is an art form and it's used to solve business problems. Start with the business problem first, work backwards on data requirements and map to the data model you already have or create an ask for what you'd like. Data cleansing is a means to this end. Sometimes there is already data in Hadoop or a EDW which is designed originally for BI use - but that's ok. A data scientist should certainly consider that before evaluating  a business problem. However raw data without anything removed gives the data scientist the best leverage and/or having an extremely sophisticated logical data model that captures all the complexities ahead of time ...

Deciding what to cleanse:

Cleansing  should be context sensitive and tied to the business challenge (80% effort usually). If we are trying to find customer churn with non monetary events in transactions, the data cleansing part could be something like:

  • Analyzing churn events histograms and transition probability matrix could be created first. Understand the distribution and event transitions. Decide what events to drop or keep.
  • How about finding the significant variables - that would be data driven? Removing columns prior to analysis using 'domain knowledge' can result in signal removal. Need to be carefully decided with quantitative techniques.
  • How about sessionizing the data ? What should be the session timeout and what should be the partitions ? Tie to the business problem without getting your data sessionized ahead of time.
  • Mutual Information Removal - Work on removing less significant 'discriminating' events both on churn and non churn data streams.
  • Define outliers - There is a thin line (no pun intended) between a good non-linear model and an outlier. Find the difference!

The data scientist will decide on the predictive algorithms or ensembles next with above data , rinse/repeat until noise & signal can be separated with good test predictions. Cleansing is an iterative process exactly the same as insight discovery. Cleansing takes 80% to 90% of the time - time spent on the methods and data volume! The Predictive Analytics piece only takes 10 to 20% of the time.

Read here about how my colleague John Thuma discusses with this blog on how to maximize the time for Analytics (using technologies like Teradata Aster) - "Why Aster flips the 80/20 rule"

What if it's a new business question ?

The long list of steps for data massaging in the last para may not be relevant anymore, if it's a completely different business question. Unfortunately the existing cleansing process has to be thrown away unless we come with a framework to leverage the cleansed data and manage it carefully for reuse. Let's say we want to find influencers in a graph to measure viral adoption of a product, how do you do that with above ? Unfortunately this has to be started over to find transactions between people, which requires a different type of data preparation.

Operationalization (Production) - How does Cleansing work here ?

Once a repeatable method is discovered for a business problem, the intermediate steps could be handed over to the Data Lake/ETL/IT folks close to the data sources and make it part of a repeatable big data pipeline with data quality checks in place. Without a business problem at hand, doing any of the steps ahead of time as guesswork could be problematic and often can create headaches for a data scientist and IT redo. As more business problems gets solved by data scientists, the intermediate steps can then be part of a framework that the ETL/ELT/IT team maintains to support ongoing use.

Data Cleansing with advanced Machine Learning & Regression techniques:

Machine Learning/model building/scoring & data science is often thought as something only Data Scientists engage in. Given the maturity and broader understanding of some of the techniques, today machine learning can also be used for cleansing by IT who has access to the right tools. Take text data for example. Content with disclaimers, signatures & boilerplate language often need to be cleaned, so it can be used for topic generation, sentiment analysis etc., By applying machine learning, such noise can be removed in the curation pipeline! We can also use machine learning to do outlier cleansing using techniques such as LASSO/LARS by identifying outliers that fall outside a modified linear model. Sometimes things like pre-classification may have to be done to route certain content to certain algorithms of interest  upstream which again can rely on algorithms such as Support Vector Machines that's run downstream ...

What about Curation in Discovery and Operationalization scenarios ?

Watch this video where Ron Bodkin, Founder & President, Think Big (A division of Teradata) explains the magic of curation with open source tools and how data scientists can use it.

Unlike the process of Data cleansing, Curation is common to both Discovery and Operationalization (Production) scenarios. Curation is all about collecting, tagging, annotating data & governance so stuff is easy to find right from the beginning. Also a great blog post on a similar note by Stephen Brobst, CTO of Teradata: Is your Data Lake destined to be useless ?

Hope you enjoyed reading the blog post ...