We at Teradata are focused on using data to help achieve better business outcomes for our customers. We empower businesses to better understand their data so that they can listen and act in accordance with the needs of their customers. For our data science group at Think Big Analytics, this means using data to help drive business improve, grow and succeed.
The Use Case
Imagine ACME Corp. rolls out a new product and has been receiving hundreds of thousands of reviews about it. It is extremely important to ACME that any issues with the product are addressed as soon as possible so that they don't lose the trust of their customers. What do we do? The company knows that some reviews should be handled differently than other reviews, as some reviews or complaints if not addressed, could cost the company greatly because of the legal ramifications or the value of the complaining customer. These reviews may need to be given more care and monetary or non-monetary relief. Many businesses are classifying these reviews manually, but wouldn't it be nice to use machine learning to do the initial classification? The story that follows a simple real-world example of how we could build a foundational base model to address this business need.
The Consumer Financial Protection Bureau (CFPB) collects consumer complaints about financial products. These products can range from credit cards to mortgages to student loans to debt collection. The complaints have been stored in a database and have been openly accessible to the public since June 2012. One of the most interesting features of this dataset is the “Consumer Complaint Narrative”. The narratives are the scrubbed complaints from consumers that explains how they have been adversely impacted by a financial services product. With many complaints and other interesting accompanying features, there is an opportunity to build some machine learning models. Specific to the business case proposed above, there is an attribute that designates if the complaint resulted in the company providing some kind of relief. This sounds perfect for our use-case. A sample of the data from the websites interface:
The data can be downloaded as a CSV file from the CFPB website: CFPB Complaints
What follows is the workflow to classify complaints into one of two groups. Procedurally, we will take the CFPB data and clean it, train and test a model, and evaluate a model that helps classify a complaint as one where the consumer either receives relief, or does not receive relief. Hopefully by the end, we will have created something that can make a strong prediction, and will be useful to a business leveraging this data or data like it.
It has been said that data science and data analysis is 80% data prep and 20% model building, so we will keep that in mind as we start our analysis, as the initial work may feel slow. As a note, this work utilizes Teradata Aster SQL and SQL MapReduce as the tool, but this process could be used with any machine learning technology and language (Python, R, Scikit-learn…etc.).
Clean the data
We skip the process of loading the data into an analytical environment, and assume that the data from the CFPB site is stored in a relational database (in this case we stored it in Teradata Aster).
While it is the job of the classification algorithm to separate out our two cases as accurately as possible, an algorithm is only going to perform as well as the information provided. If we know that some components of the data will not be helpful, and are abundant, then we should do ourselves a favor and cleanse out the extra data. This human interaction and intuition still plays a role in getting initial accurate predictions from a simple machine learning model, and our base model is not going to build itself.
Classifying the consumer complaint language that is associated with receiving relief, or not receiving relief, is what is of interest. If there is no language there is no information to leverage, so we can exclude non responses from the base data set:
CREATE VIEW complaint.no_null_cfpb_2018_ as SELECT * FROM complaint.cfpb_2018_ where consumer_complaint_narrative is not null;
Transform the data
It is common practice in Natural Language Processing to remove “stop words” like “and”, “or”, “any”, “are”. This identifies words that have higher specificity to each class of complaints. Removing stop words also helps reduce the size of our data by a non-insignificant amount, helpful considering we will be taking a “sparse” approach to executing the model down the road, having all documents (complaints) split and pivoted in a row-wise fashion. To do this all of this we split words on spaces, and common punctuation like “!” and “,”:
create table complaint.ng1_nostop_cfpb_2018_ distribute by hash(complaint_ID) as SELECT * FROM Text_Parser ( ON complaint.no_null_cfpb_2018_ TextColumn ('consumer_complaint_narrative') ToLowerCase ('true') Stemming ('false') TokenColumn ('ngram') Punctuation ('\[.,?\!\]') ListPositions ('true') StopWords ('stopwords.txt') RemoveStopWords ('true') accumulate('complaint_ID') );
At this point, each complaint document has n rows, where n is the number of one-word “tokens” in the complaint.
Before proceeding, another task to consider is whether there are derived variables or features that should be created that are not currently in the data. The hypothesis developed for this work flow would need to have a complaint labeled as having received relief, or not. A simple label will help build out and test the hypothesis:
select *, CASE WHEN company_response_to_consumer ILIKE('%relief%') THEN 1 ELSE 0 END AS Complaint_Category from complaint.svn_ng1_nostop_cfpb_2018_;
Split up the data
Data science and machine learning practitioners likely know that it is best practice to split up data into a training set and a test set. For those that haven’t, to construct a model that isn’t biased, leaving us in the dark as to whether it performs well outside of the data it already sees, we should save some data for prediction that we can more confidently base our final accuracy metrics on. There are more thorough techniques we can implement as validation as we create more complex models, a training (80%) testing (20%) split of the dataset is a good start. Plus, it is straightforward to implement in SQL too:
create table complaint.svn_split_ng1_nostop_cfpb_2018_ distribute by hash(ngram_complaint_id) as select ngram.complaint_ID as ngram_complaint_id, ngram.*, complaint_ids.split from complaint.ng1_nostop_cfpb_2018_ as ngram left join (select *, case when (complaint_ID % 10) < 8 then 'train' else 'test' end as split from (select distinct complaint_ID from complaint.ng1_nostop_cfpb_2018_) as distinct_ngram) as complaint_ids on ngram.complaint_ID = complaint_ids.complaint_ID; create table complaint.svn_split_merge_ng1_nostop_cfpb_2018_ distribute by hash(ngram_complaint_id) as SELECT a.*, b.company_response_to_consumer from complaint.svn_split_ng1_nostop_cfpb_2018_ a JOIN complaint.no_null_cfpb_2018_ b ON a.complaint_ID = b.complaint_ID; create view complaint.svn_train_ng1_nostop_cfpb_2018_ as select * from complaint.svn_split_merge_ng1_nostop_cfpb_2018_ where split = 'train'; create view complaint.svn_test_ng1_nostop_cfpb_2018_ as select * from complaint.svn_split_merge_ng1_nostop_cfpb_2018_ where split = 'test';
*A colleague of mine shared an excellent article with a more rigorous investigation of the “why” ML practitioners use the train/test split for those interested: Why-Data-Scientists-Split-Data-into-Train-and-Test
Fit and model the data
If we are keeping score at home, we should be 80% through this initial SVM model workflow since we now only need to train and test the algorithm. Since the data have been pivoted in a way where the complaints are sequenced word-by-word in a row-wise fashion, the SparseSVMTrainer function will be appropriate to use.
The training table
Each attribute, ngram has a corresponding value frequency
The binary classification if relief was received, complaint_category
We output the model to a model table in the database called: complaint.svn_model_bravo_ng1_nostop_cfpb_2018_
SELECT * FROM SparseSVMTrainer ( ON (select 1) PARTITION by 1 InputTable ('complaint.svn_train_label_merge_ng1_nostop_cfpb_2018_') ModelTable ('complaint.svn_model_bravo_ng1_nostop_cfpb_2018_') SampleIDColumn ('complaint_ID') AttributeColumn ('ngram') ValueColumn ('frequency') LabelColumn ('Complaint_Category') Hash ('true') );
Using the model built from training (only took about 5 minutes) we can run the predictions on the previously created test dataset.
SELECT * FROM SparseSVMPredictor ( ON complaint.svn_train_label_merge_ng1_nostop_cfpb_2018_ as input partition by complaint_ID on complaint.svn_model_echo_ng1_nostop_cfpb_2018_ as model dimension SampleIDColumn ('complaint_ID') AttributeColumn ('ngram') ValueColumn ('frequency') AccumulateLabel ('Complaint_Category') );
Sample predictions are output below:
The results from constructing a confusion/error matrix are shown below. There are three dimensions of output here to help us make sense of the model that is predicting resolution of relief vs non-relief. Here is the breakdown:
The “Table of Confusion” shows the correct true negatives, or a prediction of non-relief when there was no relief for the consumer, as well as the true positives where the prediction of relief for the consumer was correct. The Table of Confusion also shows incorrect classifications. The top right quadrant shows where there was a prediction of no relief when in fact there was relief, which is a false positive (from the standpoint of predicting non-relief). The false negative case is in the bottom left quadrant where the model predicted relief when it should not have.
The next table is a statistical summary of the prediction and a test of the accuracy statistic. For those that have taken a course or two in statistics, may notice that the P-value of the accuracy test is 0. For those that haven’t taken much stats, this means that with this model there is extremely strong evidence that the SVM model based predictions are better than just guessing (flipping a coin) whether a consumer is going to receive relief based on their complaint. The fact that the 95% confidence interval does not contain 0.5, also supports the statistical significance of the model predictions.
We found in the previous summary table that the testing accuracy of our model is 83.43%. Which seems decent enough, however the goal is to support the goal of the business. Maximizing the accuracy of the model has a risk of solving a problem with the best model for a sub-optimal solution. For example, misclassifying non-relief too often when historically the consumer should receive relief, could prove to be costly to the business long term as this could impact customer satisfaction and churn. In addition, this could create inefficiencies by misclassifying the complaint. In the following table, derivative statistics from the Table of Confusion are provided to better understand the context of the accurate and inaccurate predictions. In the example of expediting relief to complaint narratives that may eventually receive it, we focus on the specificity of the relief case(1) of 94.88%. If the goal of the business is to provide relief when it should, this is more informative and an even better metric than the first accuracy metric. Of course, we would have to determine the implications of each classification for the business, correct or incorrect, before proceeding.
If curious about the confusion metrics derivatives, refer to the Wikipedia page: Confusion Matrix
Comparable to the last table in the confusion matrix, the F-measure calculation contains the accuracy measures of precision and recall. Precision and recall are common concepts/metrics in binary classification, where precision is the percent of true positives of a class compared to all the cases we classified as belonging to that class, and where recall is the total percent of correct classifications of occurrences of the class over all true occurrences of the class. Both metrics correspond to some of the computations in the final confusion metrics table… can you find them all for each classification case? In addition to precision and recall, the F-measure is the harmonic mean between these two and gives an informative single metric of classification accuracy for each class of consumer complaint. By viewing the F-measure calculations we are generally much better at classifying the case where the complaint won’t lead to relief (class 0).
Final F-measure for the SVM classifier: 83.43%
Where to next?
One of the exciting things about data science is that there is always an opportunity to improve our work. Perhaps the goal was to achieve an F-score of 80% or greater, in which case our model achieved this (83.43%) on an 80/20 train/test data split, satisfying the requirements of the business. Although, there is certainly room to improve our model. There is a myriad of different ways we can try to improve model predictions, like getting more data, trying new algorithms etc.
One thing we can do is try to enrich the data that is used to predict whether a complaint will be classified as leading to non-relief or not. We used the term frequency as the value for each attribute (word/1-gram). Alternatively, we could try using a weighted score for each word per document like Term Frequency - Inverse Document Frequency, or we can include the position of the word in the document, or maybe we can include the part of speech (POS) of the word and re-run SVM with one-hot encoding. The point is that there are ways to leverage more information to make our predictions more accurate and the model more powerful as we try new things.
Another endeavor that might help improve the accuracy of the model are the values we use as initial parameters for the Support Vector Machine algorithm training. For the above workflow we used the default parameters and let the software choose, but the more we know about our data the more we can tune the model. We can alter the gamma, degree, bias, epsilon, etc. based on the data profile.
The simple analysis in this post serves as a demonstration of how straightforward answering a business use-case can be with Machine Learning in Teradata. This approach used Teradata Aster Machine Learning, however our team Think Big Analytics is a technology agnostic group, and we could use other tools together to get the job done. Whatever the customer wants and needs We believe that the business needs come before the technology, as we strive to live by our company slogan of being “Business-Outcome Led and Technology Enabled”.
Feel free to reach out with feedback on this article, or questions about Teradata and what we do:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.