Data Scientist - Machine Learning for SQL Aficionados ...

Learn Data Science
Teradata Employee
I live and breathe SQL each day and want to do machine learning - Can I do it without jumping through too many hoops ?
Can I do machine learning in SQL like 1,2,3 ?  I got anywhere between 0.5 to 100 TB of data on tables either on an RDBMS or a data lake.

Day in a life example on how to do machine learning with RandomForest algorithm (example taken from Aster Foundation Guide).

A table has specs on different cars with a price_class category as the last column. The price_class value basically puts each car in a different price bucket.

You can apply machine learning to this and build a model that learns the implicit rules that decides the price_class (last column). Note that some of the input columns are categorical in nature and some are numeric. But that's ok! The algorithm will try to figure out the rules from the inputs, in a smart way to map to the output. Here's the SQL for it:

The Random Forest model is now stored in a SQL table called "my_model".

Now given the same table as input above, let's try to predict the price_class like it's not there. We will use the learnt model "my_model". Ideally speaking the original price_class and the predicted price_class should be the same - right ?

The output looks like this:

So how many were accurate ?

Here's an inner join to match the above prediction to the original table with the price_class column.

and the output is:

We got 17 out of 27 correct

Wait - Is this really SQL ?

The SQL syntax above is a vendor extension of SQL called SQL/MR that is available with Teradata Aste...

Here's the link to learn more ...

The accuracy is kind of low. I'd like to get 95% and above:

You bet. We only have 27 rows to start with. With more data you can get good accuracy. Algorithms work well with large set of data. However even with this small data set, you can still apply techniques like dimension reduction (more SQL like invocation), variable elimination, hyper parameter tuning and trying other algorithms. Also known as feature engineering, you can stay in a SQL studio and get your accuracy high as high as 95% by doing some iterations - not covered in this blog post.