Regression is a method of doing analysis. Basically it helps you predict the future. Businesses use these models to help explain customer behavior which can make them more profitable.
But wait a minute, who started this whole Regression thing? The guy who invented Regression, Sir Francis Galton, was studying how the height of fathers predicted the height of their sons. He showed on average that short fathers had taller sons and tall fathers had shorter sons. He called this condition ‘regression to mediocrity’ and the term stuck.
I’ll be talking about both Linear Regression (Use X to estimate Y, using a line) and Logistic Regression (Use X to estimate probability of success of Y). After discussing the basics of how each work, we'll move on to doing some coding in Aster to solve some Use Case scenarios. Let's get started.
Linear Regression is used when want to compare one variable Y to another variable X where there’s a constant variance across X. Y is considered the dependent variable (what we’re trying to predict) and X is the independent variable (the known data and sometimes called the Predictor variable). Note that Predictor variables can be either Continuous (numeric) or Categorical (text) and multiples are allowed. Linear Regression is appropriate when Y has a linear relationship to X (this occurs when there’s a correlation between 2 variables).
To confirm a strong coefficient between 2 variables, a Scatterplot is useful as shown below. Scatterplots are used to display the relationship between 2 quantitative variables. Sure enough this looks like we can draw a straight line through the points to build a model to estimate a User’s weight based on their Height. So these 2 variables are good candidates for Linear Regression. The Y axis is Weight (what we are trying to predict based on previously known data and the X axis is Height. If the points roughly trace a line, there’s a strong correlation. If not, then you may want to investigate a different algorithm instead of Linear Regression.
Of course, we’ll first need a dataset of known X and Y variables to run our algorithm against (An Algorithm is a procedure that outputs a calculation). Once we have this output we can then run this against a dataset of known X variables that have no Y variables in an effort to predict Y. We call the known data for both X and Y as the Training dataset. We call the known data for X but unknown data for Y the Prediction dataset.
Of course there’s mathematical equation involved with finding the straight line through those points in the graph and it’s pretty complicated with lots of fancy terminology. For example, Linear Regression uses something called Least Squares Regression Line (LSRL). Its job is to attempts to find the best fit line through the Training dataset by minimizing the sum of squared differences between observed values and predicted values. It ends up determining the Y-intercept and the Slope of the regression line. Remember these terms as you’ll see this once again when we run the Aster function. Here's an example of best-fit line:
It also uses something called Coefficient of Determination to figure out the variance in the dependent variable (Y) predicted form the independent variable (X). It’s formula is r raised to the power of 2.. Here’s a quick example. You want to predict heating bills (Y) based on home size (X). The correlation between predicted bills and home size = .70. So the Coefficient of Determination is .7 raised to 2nd power which equals .49. Therefore 49% of the variability in heating bills can be explained by home size. That’s all we need to know for now for how Linear Regression gets calculated.
Logistic Regression is different than Linear in a number of ways. First off, Y is considered to be a Boolean value (ie: typically 0 or 1). Examples would include Got Job (1) or Didn’t get Job (0), Male (1) or Female (0). Another difference is Logistic will compute the base rate of X value for the NULL model. In other words, suppose we attempt to predict IQ (Y) based on number of years of education (X). Logistic Regression will compute the IQ of a person with no years of education in its calculation.
Logistic Regression is like Linear Regression in that it uses independent variable(s) X to predict for dependent Y. And they use Coefficients also. Suppose the dependent variable is : Loan. If Loan Approved, then 1. If Loan Denied, then 0. The Independent variable is ROA (return on assets). All the probabilities must fall within a range between 0 and 1. Since it is technically possible to fall outside 0-1 range we use something called Logit to force predicted Y values between 0-1. This creates the S-shaped line rather than the straight line we see in Linear Regression.
By bending the curve to make sure we don’t go above 1 or below 0, our coefficient changes as we get near these 2 edges. So unlike Linear Regression where we could say that a 1% increase in ROA increased loan probability by constant 2.2 percent (Slope), Logistic Regression says that at .5 on Y axis the likelihood of getting a loan increases by 2.75 % for each 1% increase in ROA. But near the edges (0 and 1), the probability changes by only about 1%.
One more thing about Logistic Regression. It is concerned with Odds and Probability to make its predictions. For example, suppose the Probability of getting a promotion is 33%. Then your Odds are calculated as Probability / 1 – Probability). So your Odds of getting that Promotion = 50%. Based on Odds and Probability, the algorithm then assigns a Weight to each X variable. You'll see the Weight calculation when we run the Aster function shortly.
Now that we have a basic understanding of Linear and Logistic Regression, it’s time to quit talking about it and start using these functions to answer some hypothetical queries.
Goal: We want to predict if a Car will be Stolen (Y) based on 4 variables (X).
These 4 X-variables are Year, Color, Type and Origin. We have some historical data that in our table as follows:
It’s important to stop right now and go over some caveats that are not evident at first blush. For starters, Aster’s Logistic Regression only works with a Y-column of BOOLEAN and X-columns of either BOOLEAN or REAL or INTEGER. So make sure your CREATE TABLE statement have these data types. That’s why you see a bunch of numbers and T/F values in the above table. So don’t even think of using text-based data types for your variables if you want to run Logistic Regression.
Here’s the legend so you know what the 4-X variables represent:
There will be 2 sets of code to run. The first statement is for the Training dataset of know values.
SELECT * FROM log_regression (ON (select 1) PARTITION BY 1
COLUMNNAMES('stolen', 'year', 'color', 'type', 'origin'));
It is important to note here that COLUMNNAMES must lead with the Y-variable (in our case, STOLEN), followed by all the X-variables. The INPUTTABLE points to the Training dataset table (known values) and the OUTPUTABLE is the table that will be created automatically when the code is run (you do not have to do a CREATE TABLE statement). If you wish to re-run the code however, you will have to either provide a different OUTPUT name or DROP the old one.
The Output from the code will look something like this. The one of the left is from the dataset above. The one of the right is from a different dataset where Origin was changed to reflect all Origin = Domestic as Stolen and all Origin = Foreign as Not Stolen.
As you can see for each X-variable, it assigns a Weight. A Weight closer to .5 means a small correlation. A Weight farther out from .5 means stronger correlation. Refer to the Analytics Users Guide for optional arguments and their details.
Now that we have this Output table, we can now run the Prediction dataset against to see what it says about a Cars chance of being Stolen or not. Note there is no Y-column in this dataset. We want to Predict if RED SUV DOMESTIC (which is not in the known dataset) based on any number of Years will be candidate to be stolen.
The prediction code is simple. I just point to the table above and the ‘Weigh’ table I created using the previous code.
FROM log_predict (ON LogRes_CarTypeCandidate
WEIGHTSTABLE('logres_model')) order by 1;
Here's the result set:
As you can see, Logistic Regression predicts 1-5 year old Red SUV Domestics have TRUE chance of being Stolen. However 6-7 year old cars are predicted not to be Stolen. Since I don’t want my insurance rates to be sky high, I think I’ll settle for the older cars on the dealer lot when it comes time to purchase. That's it for Logistic; time to try our hand at Linear.
Goal: Want to predict IQ of people based on their Caffeine intake. (By the way, this is fictitious data)
Obviously the Y-variable is IQ and the X-variable is Caffeine. Here’s the Training dataset of known values.
Here’s the code for the Training set:
SELECT * FROM LINREG
PARTITION BY 1);
Here are some observations. We are PARTITIONing by 1 since we need all the work to be submitted to one v-Worker. So that is why that is there. Notice I did not specify the X or the Y variables. It is assumed that the Y-variable is the last column in the Training dataset. So it’s important to make sure your CREATE TABLE is modeled correctly. Note I could have had more X-variables if I wanted, but decided to keep it simple. The Output is as follows:
0 is the Y-intercept and 1 is the Slope. As mentioned previously, the Y-intercept is the Predicted IQ of person with No caffeine. In other words, the value of Y where the line touches Y when X = 0
Slope represents for each additional unit of Caffeine, IQ will be incremented by this amount (.134).
The next step is to run Prediction on the Prediction dataset. Before we do, here’s the dataset we’ll be running against. As you can see, I know the X-variable, but don’t have a Y-variable in the table.
Unlike Logistic Regression, there is no function for Prediction. The user must write the code manually using the above 2 values from Y-intercept and Slope. The formula to predict IQ is:
Y = a + bX
where Y = value to be predicted, X =known value, a = Y-intercept and b = Slope
Here’s the code and the Ouput.
SELECT id, caffeine, 91.3 + .13 * caffeine as PredictedIQ
FROM LinearReg_Predict order by 1;
The first 2 columns are just a reiteration of what was in the Prediction dataset. The 3rd column is the predicted IQ for each id.
Notice the Predicted IQ for ID 0 who has 0 caffeine intake. This value matches the Y-intercept (which it should).
Looking over the data, there is a correlation. The more caffeine you drink, the smarter. Hmmm, I’m getting thirsty. Think I’ll have for a Mountain Dew. Maybe 2, actually.
Regression is a many splendor thing. Which one you use (Logistic or Linear) depends on a number of factors. The overriding one however is this. If I want a Boolean answer set, Logistic Regression. If have data that is not Boolean and has linear relationship, I’d go with Linear.
Please can anyone help me ?
I transformed the required columns of data into BOOLEAN type and applied the Logistic Regression Equation .Still I am receiving the following error,
“ERROR: SQL-MR function LOG_REGRESSION failed: ERROR: SQL-MR function LOG_REGRESSION_ROW failed: Response can only take Boolean values”
Under COLUMNNAMES argument, the first column entered in the list must be the Y variable, followed by the X variables. Did you put them in this order?
Yes i followed the order which you mentioned.
Please find below executed query,
select * from log_regression (ON (select 1) PARTITION BY 1
Here target is my Y variable which contains BOOLEAN data ( 0 or 1).
Here I listed the detailed view of each column data,
Y Variable :
analytics=> select distinct target from traininglogistic;
X Variables :
analytics=> select distinct cust_sex_M,cust_sex_F from traininglogistic;
cust_sex_m | cust_sex_f
0 | 1
1 | 0
analytics=> select distinct mode_01,mode_06,mode_12,mode_90 from traininglogistic;
mode_01 | mode_06 | mode_12 | mode_90
0 | 1 | 0 | 0
1 | 0 | 0 | 0
0 | 0 | 0 | 1
0 | 0 | 1 | 0
analytics=> select distinct agt1_gyohai_0,agt1_gyohai_9 from traininglogistic;
agt1_gyohai_0 | agt1_gyohai_9
0 | 1
1 | 0
analytics=> select distinct med_0,med_1,med_2,med_3,med_4,med_5,med_6,med_8,med_9,med_A from traininglogistic;
med_0 | med_1 | med_2 | med_3 | med_4 | med_5 | med_6 | med_8 | med_9 | med_a
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0
analytics=> select distinct form_0,form_1,form_6,form_8 from traininglogistic;
form_0 | form_1 | form_6 | form_8
0 | 1 | 0 | 0
1 | 0 | 0 | 0
0 | 0 | 0 | 1
0 | 0 | 1 | 0
Instead of using LinReg, use GLM function
GLM will show you a number of stats (estimate, std_error, z_score, p_value and significance.
You can find syntax in the AAF User Guide.