Naïve Bayes is a set of functions to train a classification model. A training data set for which we know the outcome (Predictor column) based on input variable columns are used to generate the model.
We then run the model against a set of input variables for which we do not know the Predictor to see what the model says. It’s quite similar to a Decision Tree with one big exception; the input data are independent of one other. This is a strong assumption but it makes the computation of the model extremely simple.
Suppose you haven’t been feeling well so you go to the doctor and she diagnoses you with the flu. So you take the flu test to confirm this. For someone who really has the flu, the probability the test returns positive is 90%. If someone doesn’t have the flu, it returns positive 9% of the time. The test returns Positive for you. That’s not good news.
But what is the true probability you really have the flu? Hmmm, that’s a good question. Maybe 90% - 9% which means 81% the test is accurate sounds like a good guess. But actually that’s way off. Let’s use Naïve Bayes to get the really probability you have the flu.
The first thing we need is a training data set. In other words, we need to know the probability you have the flu given the population. After some research you discovery only 1% of the people in the US have the flu. This is your base rate and that’s what we work off . From here, it’s easy. We look at the input variables (Have flu, don’t have flu) and run it through the probabilities. Once we get these numbers, we apply it to an equation to get the true probability you have the flu.
As the graphic below shows, you only have a 9% of having the flu. That’ a lot different than the 81% chance we originally thought we had.
To build the training data set we will be using naiveBayesReduce and naiveBayesMap together in the code. Our known data set consists of the following table named nb_samples_stolenCars.
Our Predictor column = Stolen. The Input variables will be Year, Color, Type, and Origin. Basically we will run the code against this data and it will create a model of which cars are candidates for being stolen based on the 4 input variables. We can then run the model against an entirely new set of input criteria for a car that is not in the model and it will predict if it is a candidate for being stolen.
So let's get started. Here’s the initial code using the 2 functions:
CREATE TABLE nb_stolenCars_model (PA
The first 4 lines of code are generic as is the last line using PARTITION so there’s not much to talk about there. Let’s go over the other keywords:
That’s about it. Once you run this code, you have your model as shown below.
Suppose you are thinking about new vehicle and the car dealer says he has a special on all Red SUV Domestics between 1 and 7 years old. You are concerned about the probability of thefts so you look at the known data (from nb_samples_StolenCars) but there's no data for those vehicles. At this point, I would insert these 7 row into a table named CarTypeCandidate.
insert into CarTypeCandidate values (11, 1, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (12, 2, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (13, 3, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (14, 4, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (15, 5, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (16, 6, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (17, 7, 'Red', 'SUV', 'Domestic');
I then point to this table and run it through the model using the Predict function as shown below:
Here's the code and the result set of the query:
It looks like just about every one of the Red SUV Domestics have a good chance of being stolen except for the 7-year old model. Note the higher number between the YES and NO determines the prediction.
So there you have it. Keep your insurance rates low and buying the 7-year old vehicle.
In conclusion, Naïve Bayes creates a model that can then be used to predict outcome of future observations, base on their input variables.
Thank you Mark for great stuff. I just started to using the Aster Express. I went through the steps and found some difficulties in model generation step. I am getting the following error-
ERROR: SQL-MR function NAIVEBAYESMAP failed: RESPONSE column must be of type varchar, int, short, long, or boolean
Here the Response column is 'Stolen' which I have difined as Varchar(3). Please help.
Thanks.
The CREATE TABLE must have the column STOLEN defined as a Boolean data type.
Great! It's working!! Thanks for the quick reply.
Hi, I am back once again. I have followed the steps mentioned in "Aster Unleashed - Instaling the Analytic Libraries" from the following link- http://developer.teradata.com/aster/articles/aster-unleashed-installing-the-analytic-libraries. But unfortunately ‘naiveBayesPredict’ function not installed or not included in the 'Analytics_Foundation.zip' file. Is there any way that I can get the file?
Thanks.
These would have to be obtained through Aster Engineering as I am not authorized to hand them out.
hs734 if you just copy and rename one of the naivebayes jar files to naivebayespredict.jar they are exactly the same.
I've done that and it executes but appears that there is some error with the jdbc connection in the prediction sql-mr function.
I keep getting SQL-MR function NAIVEBAYESPREDICT failed: Connection with the database could not be established.
the part in the jar file blowing up is this but i coded a simple java funtion and connecting through JDBC works fine so i'm not sure why the conn is coming back null.
private NaiveBayesModel GrabModel() {
NaiveBayesModel nbm = new NaiveBayesModel();
String url = "jdbc:ncluster://" + domain_ + "/" + database_;
Connection con = getJDBCConnection(url, userid_, password_);
if (con == null)
{
throw new IllegalUsageException("Connection with the database could not be established");
}
try
{
ResultSet rs = null;
Statement stmt = con.createStatement();
String query = "select * from " + modelTableName_ + ";";
rs = stmt.executeQuery(query);
nbm.populate(rs, smoothing_);
con.close();
} catch (SQLException ex) {
throw new IllegalUsageException("Could not grab model. " + ex.getMessage());
}
return nbm;
}
private static Connection getJDBCConnection(String url, String userid, String password)
{
Connection con = null;
try
{
Class.forName("com.asterdata.ncluster.Driver");
} catch (ClassNotFoundException e) {
System.err.print("ClassNotFoundException: ");
System.err.print(e.getMessage());
}
try
{
con = DriverManager.getConnection(url, userid, password);
} catch (SQLException ex) {
System.err.println("SQLException: " + ex.getMessage());
}
return con;
}
Thanks for the post! However, I would like to know how to determine actual class membership probability (i.e. the probability that my car is stolen given its attributes). How can I use the last table with the loglike values to do that?