Introduction:
In this blog, the new step argument of the Aster GLM function is demonstrated for use in performing a dimension reduction by feature selection.
In machine learning and statistics, dimension reduction is the process of reducing the number of variables under consideration, and can be divided into feature selection and feature extraction. In feature extraction algorithms, such as principle component analysis, create new features from functions of the original features. In contrast, feature or variable selection techniques remove multi-collinear variables and select ones that are most strongly predictive of the outcome being modeled.
These methods simplify models making them easier to interpret. In big data problems, where there are often large numbers of features, they are necessary and essential to reduce the computational load and decrease model training times. Similarly, they are used to avoid the curse of dimensionality in domains where there are many features and comparatively few data points. They also improve model generalizations by helping to reduce over-fitting.
The driving premise is that the data may contain many features that are either redundant or irrelevant, and can be disregarded without incurring much loss of information. Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated
One common method is to perform a regression analysis many times with all the possible combinations of sub-sets and select the model with the best fit. The new step argument of the Aster GLM SQL-MR function allows one to do just that.
Input Data:
The data considered here is from the publically available Freddie Mac fixed-rate 30-year single-family data on the loan origination and monthly performance. This data often used in building, testing, and bench-marking risk and foreclosure prediction models. There are a number of numerical and categorical variables in the data set. Some of the variables can be eliminated visually. For example, there a number of fields like MSA and zip codes which are obviously correlated. All of the numerical variables were scaled to be in the same order of magnitude. Time is taken as years in epoch time, and initial loan balances are in units on 1x10^{4}$.
The training data and data used for initial GLM step test was the data from the year prior to Jan 1, 2012.
Table "fm_demo"."gb_loans_glm_input"
Column | Type |
loan_seq_num | character varying(12) |
first_pay_dt | numeric |
mature_dt | numeric |
zero_bal_dt | numeric |
credit_score | numeric |
orig_dti | numeric |
orig_ltv | numeric |
orig_cltv | numeric |
orig_upb | numeric |
orig_interest_rt | numeric |
firstime_buyer_flg | numeric |
loan_purpose_cd | character varying(1) |
channel_cd | character varying |
ppm_flg | character varying |
product_type_cd | character varying(5) |
property_type_cd | character varying(2) |
property_state_cd | character varying(2) |
postal_cd | character varying(5) |
msa_cd | character varying(5) |
seller | character varying(20) |
servicer | character varying(20) |
occ_status_cd | character varying |
foreclose_flg | character varying(1) |
orig_loan_term | numeric |
num_borrowers | numeric |
Running the Aster GLM function with step:
For the demo, only the numerical values have been included as input for the Aster GLM stepwise function. Below is the sample code for running the GLM function.
drop table if exists fm_demo.gb_loans_glm_step_num
;
select * from glm (
on (select 1)
partition by 1
inputtable('fm_demo.gb_loans_glm_input')
outputtable('fm_demo.gb_loans_glm_step_num')
columnnames('foreclose_flag'
,'credit_score'
,'orig_dti'
,'orig_ltv'
,'first_pay_dt'
,'mortgage_insr_pct'
,'num_units'
,'orig_upb'
,'orig_interest_rt'
,'orig_loan_term'
,'num_borrowers')
family('logistic')
link('logit')
maxiternum('50')
step('true')
)
;
Interpreting the Results:
The results of each fit are printed to the screen. To determine which set of variables form the best model we look at the Akaike information criterion (AIC). This is one useful measure of the relative quality of the model for the given set of data, it is a function of the number of variables fit, k, and the likelihood function, L:
Smaller values of AIC are better. Notice that the Akaike measure favors models with a fewer number of variables.
Below is a snip of the output for the run that gave the smallest AIC value of 3349.84.
GLM step Output For the Numerical Features of the Freddie Mac Loan Data Set
Conclusion:
In this blog, we have shown an example of how to use the new Aster GLM step function as an aid for feature selection to generate the most reasonable model for the Freddie Mac foreclosure data. From the table above the best model based on the AIC value is the one that includes the following variables
credit_score
orig_dti
orig_ltv
first_pay_dt
orig_upb
orig_interest_rt
num_borrowers
The following variables that do not appear in this fit can be disregarded.
orig_loan_term
mortgage_insr_pct
num_units
This will simplify the predictive models for foreclosures that we will now, train (on data before 2012), test (on accounts still open at the start of 2012), and validate. The relevant variables from the feature selection can used in number of binary classification models, including for example, logistic regression, support vector machine, and naïve Bayes
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.