Data Science - How to use Machine Learning in your data science projects - Part 1

Learn Data Science
Teradata Employee

This is probably one of the biggest ask that everyone has in this area. You signed up for a data science project. How do you put together a conceptual solution with the building blocks, betting your time and resources ? Searching Google for your problem, yields 100s of blogs/whitepapers out there that talks about a ML methodology they used and how they solved a problem. That white paper may also cite another 100 white papers in the area. One may wish to read all of them jumping over the math. Even if we could comprehend a tested solution,  there is another problem of translating that into the technological solution that you are familiar with. What libraries you'd use if it's open source or R ? You can take existing R code from there, but what was the ML library used again ?

With tools like Aster, you can easily watch a you-tube video and try out something quickly in a few minutes and fail or succeed fast ...

Yet, how does one know which algorithm(s) or methodology to use to start with, to get decent results in a short time ? Why X vs Y algorithm or methodology ? How do you go top-down on this given a business problem ?

Learning the Machine Learning (ML) Speak

Today, there is no silver bullet to a data science problem that involves Machine Learning. Even the easiest problems poses challenges in some peculiar ways and one needs the right knowledge and expertise to workaround them. This is where I find most wannabe data scientists get discouraged and become gun shy. Something worked for someone in some type of data, but I'm not sure how to map that to my business problem I have on hand - will it even work ?? !!!

However if one understands ML at the right level of abstraction on what it can do for you, that is probably the best way to start exploring for help/training required. At a minimum, it helps one to frame questions to an expert to find answers. Hopefully this blog post and subsequent posts will provide some insight ...

More on ML Speak:

While it's interesting to learn how ML works under the hood, it's important to learn the terminology at the right level so you can actually use it for your  business application.  That's why learning the ML Speak is very important ...

I want to start with a few terms and then we can map it to some problems later.

  • Distribution -> A way to explain how data is spread across time, customers, other ...
  • Model -> A mathematical structure that has information on what's inside some historical data and how they are all related to each other. An algorithm puts this together by analyzing some historical data.
  • ML Training -> The process of creating a 'model' from historical data.
  • ML Prediction -> Algorithms evaluate new data using a trained model and generate a 'score' or a 'label' and says this is a cat, dog etc.,
  • Classification -> The process of sorting the  input data into 2 or more known category buckets based on some trained model.
  • Classifier -> An Algorithm that can do Classification
  • Clustering -> Move "similar" data around into multiple groups so the groups also share something in common. Sometimes you say how many groups you want, sometimes you let the algorithm figure it for you.
  • Regression -> Trying to fit a line to a bunch of numeric data.
  • False positive -> You thought your algorithm caught something during prediction, but it was  incorrect.
  • False negative -> Something important your algorithm missed to catch.
  • True positive -> You were looking for something important and the algorithm caught it. Nice catch!
  • True negative -> You didn't care about X and the algorithm didn't catch it either ...
  • Sequence -> Ordered data -  Logs with time-stamps, Money Transactions etc.,. Both the content & sequence of events could matter !
  • Bag of Words -> Un-ordered data - like a retail basket. No one cares what order you put stuff in your basket at a retail store, but the contents are important. Online baskets are difference - I don't want to go there now.
  • Discriminate Models -> You train a model from data and teach it "cat" or "dog" During prediction on new data it tells you "cat" or "dog" with a probability.
  • Generative Models -> Your train a model and tell it there are 5 animals that the videos have. During prediction on new data, it tells you a number from 1 to 5 for each video because it knows how to separate animal videos 5 ways because of the differences it spotted.
  • Features -> Columns in your data. Age, Height, Zipcode etc.,
  • Categorical, Text, Numeric, Boolean data -> This is a no-brainer. Categorical means your column has a lot of unique values like colors - 'red','blue','green' etc.,. Text data is usually word sentences, paragraphs and sometimes an entire blog post, patent data, email etc., Numeric is numeric. Can be a integer, float etc.,. Boolean -> true or false/male or female etc.,
  • Dimension Reduction -> Reduce the # of columns that we need to process. Helps your algorithm to train and predict faster and even give you good results !

There are  1000+ more terms & ideas built on top of this that address all the different nuances. However, I can safely say that the entire world of ML algorithms revolves around the basic terms above. If you can explain your business application with above ML speak, it's half-way to deciding what algorithms you would be using in your application.

More in Part 2 of the blog (will appear in Teradata Aster Community Portal soon) ...