Let's start with a simple example - We want to build a model to predict which 9th grade students are most likely to get A in a math class based on historical data. We have the students grades on all the subjects on all the classes available since 6th grade. You also have other "profile" information on each student, what their level of education their parents are, how many years the students were in the same school etc., Let's add weather, distance from school etc.,
Before jumping into using advanced machine learning, deep learning etc., there is an important step a data scientist does. It's called Feature Engineering. Deciding what variables that should be considered in a model. We give this problem to 3 data scientists, and they would invariably come up with three different feature engineering solutions. See below on how different data scientists will decide to consider as an input to the model.
All math grades since 6th grade + allow only yes or no for each parent if they went to graduate school or not. Add distance, but use a percentile measure. Ignore weather.
All math and science grades above B grade since 8th grade only, codify parents education to four categories - high school, undergrad, graduate and postgrad and above. Add weather with just the average night temperature - who knows if students efforts are correlated to the weather ?
All math, science, English grades only, but use the highest grade in each category. If A and above use 1 or 0. Ignore weather and distance from school - not my hypothesis.
Surprised ? This is exactly how different data scientists make decisions ahead of time on what the input should be. Each data scientist has a rationale or technique on how to choose the inputs to 'encode' the information in the data based on other projects, white papers, domain expertise and sometimes by basic profiling of data.
The art of choosing how variables are selected, massaged and rejected is called "feature engineering."
Data Cleansing vs Feature Engineering:
We have heard that a lot of times data scientists do janitor work to clean the data for analytics. It's partially true, but not entirely. A lot of times data scientists spend time "massaging" the data and doing feature engineering pre-step and in that process are forced to do cleansing to get to where they want. Feature engineering is often overlooked in analytic situations by business folks. It's one of the most difficult tasks to do - which features to keep, modify or drop before algorithms can start working can become a pretty arduous task. If your models are not performing well, it's very interesting problem to go and fix the feature level information to increase the accuracy of a model.
Why Feature Engineering is important:
All machine learning algorithms require a problem to be represented as "features" or variables which are basically a form of information representation. The better the information is encoded in the features, the analytical models built will be more robust.
Feature Engineering is not an isolated separate step:
There is another confusion in the data science world that feature engineering or cleansing is somewhat a separate step and once done, analytics can be performed. In reality, Feature Engineering is done to bootstrap the data for analytics, and often done iteratively with Data Science tasks. Do a bit of Feature Engineering, build the models, validate. If not ok, fix the Features and iterate on model building again, etc.,
Common Feature Engineering tasks:
Some of the most common feature engineering tasks include:
Find correlation between variables. For example, if height and weight are correlated, choose one instead of both.
Variable elimination - Just keep the top N significant features by running some algorithm
light saber (c) ign
Do dimension reduction. If start with 100 numerical variables features, run Principal Component Analysis to whittle it down to a 10s of derived features using linear transformations.
Use Evolutionary programming techniques to find ways to combine variables in interesting ways to derive complex features.
When using image inputs, often Convolution techniques are used to reduce an area of pixels to few to grab features like contours by stacking them in layers.
Run SAX (Symbolic Aggregate Approximation) to extract Alphabets from Time Series data
Get NGRAMS from text data (unigrams, bigrams, trigrams and skipgrams)
Get PageRank, Betweenness, Local Clustering Coefficients etc., from nodes in a connected graph
Can we eliminate Feature Engineering step entirely ?
Deep Learning techniques often claim to eliminate the Feature Engineering step. If you are doing image recognition, a Deep Learning stack will use convolution techniques to grab features from images instead of working at a pixel level. However, there is work involved in tuning the convolution parameters/stack to get stuff right. In general, with techniques like Deep Learning, a feature engineering problem often morphs into an architectural/ configuration type issue, which also requires multiple iterations to get it right or tune it the desired effect.
Multi-Genre (TM) in Feature Engineering:
This is unique to a platform like Teradata Aster where predictive model accuracy can be increased consistently by mixing rich features. Features extracted from text, graph and time series coefficients can be mixed easily and seamlessly to provide much richer models. The features can be engineering further to be used with any algorithm within Aster or other open source platforms.