Using Aster Data's Naive Bayes functions

Aster
Teradata Aster is an analytic platform that embeds MapReduce analytic processing with data stores. •Embedded MapReduce analytic processing and unique SQL-MapReduce® framework •Massively parallel data store for multistructured data •Intuitive tools and SQL-MapReduce libraries for rapid analytic development
Teradata Employee

Using Aster Data's Naive Bayes functions

Naïve Bayes is a set of functions to train a classification model.  A training data set for which we know the outcome (Predictor column) based on input variable columns are used to generate the model. 

We then run the model against a set of input variables for which we do not know the Predictor to see what the model says.  It’s quite similar to a Decision Tree with one big exception; the input data are independent of one other.  This is a strong assumption but it makes the computation of the model extremely simple.

So let’s first look at a generic example to make sense of all this before we write the code.

Suppose you haven’t been feeling well so you go to the doctor and she diagnoses you with the flu.  So you take the flu test to confirm this.  For someone who really has the flu, the probability the test returns positive is 90%.  If someone doesn’t have the flu, it returns positive 9% of the time.  The test returns Positive for you.  That’s not good news.

But what is the true probability you really have the flu?  Hmmm, that’s a good question.  Maybe 90% - 9% which means 81% the test is accurate sounds like a good guess. But  actually that’s way off.  Let’s use Naïve Bayes to get the really probability you have the flu.

The first thing we need is a training data set.  In other words, we need to know the probability you have the flu given the population.  After some research you discovery only 1% of the people in the US have the flu.  This is your base rate and that’s what we work off .  From here, it’s easy.  We look at the input variables (Have flu, don’t have flu) and run it through the probabilities.  Once we get these numbers, we apply it to an equation to get the true probability you have the flu.

 As the graphic below shows, you only have a 9% of having the flu.  That’ a lot different than the 81% chance we originally thought we had.

So that’s the Big Picture.  Now let’s move our attention to an example of Naïve Bayes using the 3 pre-built functions in Aster: naiveBayeReduce and naiveBayesMap and naiveBayesPredict.

To build the training data set we will be using naiveBayesReduce and naiveBayesMap together in the code.  Our known data set consists of the following table named nb_samples_stolenCars.

Our Predictor column = Stolen.  The Input variables will be Year, Color, Type, and Origin.  Basically we will run the code against this data and it will create a model of which cars are candidates for being stolen based on the 4 input variables.  We can then run the model against an entirely new set of input criteria for a car that is not in the model and it will predict if it is a candidate for being stolen.

So let's get started.  Here’s the initial code using the 2 functions:

CREATE TABLE nb_stolenCars_model (PA

The first 4 lines of code are generic as is the last line using PARTITION so there’s not much to talk about there.  Let’s go over the other keywords:

  • ON clause points to the known data in the Table as shown in earlier screen shot
  • RESPONSE points to Predictor column; in this case, the Stolen column
  • NUMERICINPUTS and CATEGORICALINPUTS  points to the input variable columns (in our case, Year, Color, Type, Origin).  Note these are broken out by data type with Year being NUMERIC and other 3 (Color, Type, Origin) being lumped into CATEGORIC since they are text-based.

  That’s about it.  Once you run this code, you have your model as shown below.

At this point, you can run against this model against the naiveBayesPredict function and point to new row that you wish to Predict if the car will be stolen or not.

Suppose you are thinking about new vehicle and the car dealer says he has a special on all Red SUV Domestics between 1 and 7 years old.  You are concerned about the probability of thefts so you look at the known data (from nb_samples_StolenCars) but there's no data for those vehicles.   At this point, I would insert these 7 row into a table named CarTypeCandidate.  

insert into CarTypeCandidate values (11, 1, 'Red', 'SUV', 'Domestic');

insert into CarTypeCandidate values (12, 2, 'Red', 'SUV', 'Domestic');

insert into CarTypeCandidate values (13, 3, 'Red', 'SUV', 'Domestic');

insert into CarTypeCandidate values (14, 4, 'Red', 'SUV', 'Domestic');

insert into CarTypeCandidate values (15, 5, 'Red', 'SUV', 'Domestic');

insert into CarTypeCandidate values (16, 6, 'Red', 'SUV', 'Domestic');

insert into CarTypeCandidate values (17, 7, 'Red', 'SUV', 'Domestic');

I then point to this table and run it through the model using the Predict function as shown below:

Here's the code and the result set of the query:

It looks like just about every one of the Red SUV Domestics have a good chance of being stolen except for the 7-year old model.  Note the higher number between the YES and NO determines the prediction. 

So there you have it.  Keep your insurance rates low and buying the 7-year old vehicle.

In conclusion, Naïve Bayes creates a model that can then be used to predict outcome of future observations, base on their input variables.   

9 REPLIES
Teradata Employee

Re: Using Aster Data's Naive Bayes functions

Thanky you Mark for these great explanation on naive bayes usage. The results are tremenduous.
I've tried to reproduce steb by step the use case and apparently naiveBayesPredict function is not included in the provided aster express VM (v4.6.2)

Re: Using Aster Data's Naive Bayes functions

Hi, I'm using Release: 4.6 Build: 4.6.2-r27284.
You can find install guide here: http://developer.teradata.com/aster/articles/aster-unleashed-installing-the-analytic-libraries
N/A

Re: Using Aster Data's Naive Bayes functions

Thank you Mark for great stuff. I just started to using the Aster Express. I went through the steps and found some difficulties in model generation step. I am getting the following error-

ERROR:  SQL-MR function NAIVEBAYESMAP failed: RESPONSE column must be of type varchar, int, short, long, or boolean

Here the Response column is 'Stolen' which I have difined as Varchar(3). Please help.

Thanks.

Teradata Employee

Re: Using Aster Data's Naive Bayes functions

The CREATE TABLE must have the column STOLEN defined as a Boolean data type. 

N/A

Re: Using Aster Data's Naive Bayes functions

Great! It's working!! Thanks for the quick reply.

N/A

Re: Using Aster Data's Naive Bayes functions

Hi, I am back once again. I have followed the steps mentioned in "Aster Unleashed - Instaling the Analytic Libraries" from the following link- http://developer.teradata.com/aster/articles/aster-unleashed-installing-the-analytic-libraries. But unfortunately ‘naiveBayesPredict’ function not installed or not included in the 'Analytics_Foundation.zip' file. Is there any way that I can get the file?

Thanks.

Teradata Employee

Re: Using Aster Data's Naive Bayes functions

These would have to be obtained through Aster Engineering as I am not authorized to hand them out.

Re: Using Aster Data's Naive Bayes functions

hs734 if you just copy and rename one of the naivebayes jar files to naivebayespredict.jar they are exactly the same.

I've done that and it executes but appears that there is some error with the jdbc connection in the prediction sql-mr function.

I keep getting SQL-MR function NAIVEBAYESPREDICT failed: Connection with the database could not be established.

the part in the jar file blowing up is this but i coded a simple java funtion and connecting through JDBC works fine so i'm not sure why the conn is coming back null.

  private NaiveBayesModel GrabModel() {
NaiveBayesModel nbm = new NaiveBayesModel();

String url = "jdbc:ncluster://" + domain_ + "/" + database_;
Connection con = getJDBCConnection(url, userid_, password_);
if (con == null)
{
throw new IllegalUsageException("Connection with the database could not be established");
}
try
{
ResultSet rs = null;
Statement stmt = con.createStatement();
String query = "select * from " + modelTableName_ + ";";
rs = stmt.executeQuery(query);
nbm.populate(rs, smoothing_);
con.close();
} catch (SQLException ex) {
throw new IllegalUsageException("Could not grab model. " + ex.getMessage());
}
return nbm;
}

private static Connection getJDBCConnection(String url, String userid, String password)
{
Connection con = null;
try
{
Class.forName("com.asterdata.ncluster.Driver");
} catch (ClassNotFoundException e) {
System.err.print("ClassNotFoundException: ");
System.err.print(e.getMessage());
}
try
{
con = DriverManager.getConnection(url, userid, password);
} catch (SQLException ex) {
System.err.println("SQLException: " + ex.getMessage());
}
return con;
}

Re: Using Aster Data's Naive Bayes functions

Thanks for the post! However, I would like to know how to determine actual class membership probability (i.e. the probability that my car is stolen given its attributes). How can I use the last table with the loglike values to do that?