Data Science - Putting Machine Learning/AI Research Papers to Practice with R ...

Learn Aster
Teradata Employee

If you are from a BI world like me going into data science, you probably would've run into the hurdle of interpreting Machine Learning data science white papers. Probably your colleague forwarded you that. Your eyes glaze over the math and soon you find yourself reading the Conclusion section of how the science worked and what the results meant. And then you draw a blank! Nothing actionable here! How do I translate this into something that I can use ?

So why bother ?

Well if you have being playing with Machine Learning and solving customer problems, it's easy to wear a practitioner's hat. If you are using platforms like Aster, that is probably a best way to start (at-least that's how I got here). Get in with less friction, solve problems with data in a matter of minutes with extremely less distraction. If you are familiar with Spark Mlib, that's great too. There are design patterns a practitioner uses repeatedly, no matter what ML algorithm you use. It's not a bad place to start to understanding research white papers and put them to use.

Evolutionary vs Ground breaking

Most research papers on Machine Learning/AI  talks about a well known algorithm or understanding and applying a twist to it (evolutionary ). I say most, because there are relatively  very few white papers that actually invent something or bring out a hidden gem that no body knows to put a dent in the algorithmic universe. Some awesome examples of something really really different as opposed to a close cousin of popular work are:

  • Rabiner's Hidden Markov research paper on Speech Recognition
  • Charles Sutton & Andrew McCallum's "Introduction to Conditional Random Fields" paper
  • Eamon Keogh & Jessica Lin's Symbolic Aggregation Approximation

Things to look for in ML research papers ..

If we just focus on AI/Machine Learning, you'll find that it usually starts with a specific use case or problem like Click Stream Prediction, Text Disambiguation, Image Recognition and what not. You google your problem, you land a white paper from a couple of sources - Cornell's ARXIV is my favorite. Some archives let you get PDFs, some also point you to GITHUB for getting the source referenced in the white paper - not bad! I think this is an important development recently.

Most ML papers are all about repeating a well tread technique, however offer modifications to it - which is the point of the paper. However, tt also shows the  ROC  - Receiver Operating Characteristic charts that describes how well we are doing with false positives against false negatives. Most of them also has AUC (Area Under the Curve) measures that describes how accurate the technique is compared other methods. The ROC and AUC are the easiest measures of any AI/ML technique IMHO. It basically connects the data to the predictive ability of the algorithm described in the white paper and in fact holds the proof statement that the technique was tested with good results! If you can locate that quickly, it would indeed hold the finding of the research paper ...

Pay no attention to the Math - just kidding!

This is the most trickiest part of reading a research paper, when you are digging for nuggets for your data science problem! Sometimes even someone with a PhD would have a hard time understanding some of the math in a fairly well developed white paper! This is probably the most disheartening piece of the exercise. You do need a fair bit of calculus knowledge and probability theory to interpret the equations and the language that describes it. The authors never makes it easier - understandable because white papers are meant to convince academic peers/advisers and research community of the work, but not the practitioners! What I've tried doing is look for is the 'motive' behind the math section. Just ask 'what does the author trying to accomplish with this'. This requires a bit of patience ... Look for the input variables involved and the outputs and where does the author claim success ? A bit of wikipedia, blogs and of course advice from smart people you work with, you've cracked the white paper!

Putting the science to work with R

Practitioners love downloadable code and that's not easy to come by. But there is hope. This is where I see the language R playing a big role. As the academic community tests their hypothesis on problem, R is quite within their reach to play with statistics tool than any other language. That's why you have 4000+ R packages published by the researchers in university, that you can leverage big time in  projects.

It turns out R is the bridge between ML academic community and practice! R is highly limited to run on a Desktop today because of memory constraints. However platforms like Teradata Aster makes it easier for R code to be run to point to big data with a few changes or actually run R inside the cluster itself!

You also don't have to run everything parallel initially. Get the code working in a big memory server environment and start modeling in a single thread. Scoring BIG DATA in parallel would be relatively easier (as you can partition the data easily) by using R for scoring in parallel.

Hence, locating the R package(s) that ties the paper  could be hugely valuable and is a good first step.

Conclusion

Great you tube video below by  John Thuma, Amy with Prof. Diego Klabjan of Northwestern Univ. describing his experience using R with Aster.

https://aster-community.teradata.com/community/learn-aster/blog/2016/02/12/webinar-become-an-r-power...

If you are an Teradata customer, you can also download all the R related code + package + documentation (Tks Roger Fried!) from the Aster Community Portal:

https://aster-community.teradata.com/groups/aster-client-advisory-board/content?filterID=contentstat...

If you are not familiar with R at all, I recommend doing an online course. Data Camp is a really good place to start ...