I hate to say this, but we build models and discover business insights off of faulty data. We like to think that the data we work with is pristine, flawless, perfect, but often it's just plain dirty. The exceptions to this rule are often similar to the Iris Data Set: complete and probably perfect, but minuscule in comparison to the size of modern-day data spaces.
There are three main things that make data dirty:
We can often identify the first two during data exploration. Nulls and instances like a work tenure of 2107 years jump out fairly easily. Sometimes we're able to update these with a ground truth from elsewhere in the data space. Other times we do data cleansing by removing rows or replacing values as appropriate.
It is much harder to identify or fix the third type of dirty data: data that is non-observably incorrect. This may be fine in the day to day life of or data but is especially bad when we think about supervised learning problems.
Imagine that we poll 100 people, ask them them "Do you think you're happier than the average person?" and get the below distribution:
We have 71 hypothetical people that see themselves as more than averagely happy and 29 who see them selves as less than averagely happy. But how many of those 71 would actually just being uncomfortable saying they were unhappy in a survey? And are there any of our less-than average happy people who are just humble and would be uncomfortable saying they were better than average in a survey?
|What They Felt|
|What They Said||Yes||?||?||71|
We could build a model on this data set that is perfect in prediction and has a huge F-Score. But depending on how deceptive our poll-takers were feeling we may end up with a model that is bad at actually predicting happiness. Perhaps too existential of an example. But in our businesses do we want to predict the real answers or the answers skewed by false data?
This gives us something to think about in all of our predictive models. When we have a model with recall of 70 percent, is it that we are missing 30% of the category or that 30% of the category isn't actually that category? I hypothesize that it's likely somewhere in between.
So what can we do?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.