Using the New Aster GLM step Argument for Model Feature Selection

Learn Data Science
Not applicable

Introduction:

In this blog, the new step argument of the Aster GLM function is demonstrated for use in performing a dimension reduction by feature selection.

In machine learning and statistics, dimension reduction is the process of reducing the number of variables under consideration, and can be divided into feature selection and feature extraction.  In feature extraction algorithms, such as principle component analysis, create new features from functions of the original features.  In contrast, feature or variable selection techniques remove multi-collinear variables and select ones that are most strongly predictive of the outcome being modeled.

These methods simplify models making them easier to interpret. In big data problems, where there are often large numbers of features, they are necessary and essential to reduce the computational load and decrease model training times.  Similarly, they are used to avoid the curse of dimensionality in domains where there are many features and comparatively few data points. They also improve model generalizations by helping to reduce over-fitting.

The driving premise is that the data may contain many features that are either redundant or irrelevant, and can be disregarded without incurring much loss of information. Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated

One common method is to perform a regression analysis many times with all the possible combinations of sub-sets and select the model with the best fit.   The new step argument of the Aster GLM SQL-MR function allows one to do just that.

Input Data:

The data considered here is from the publically available Freddie Mac fixed-rate 30-year single-family data on the loan origination and monthly performance.  This data often used in building, testing, and bench-marking risk and foreclosure prediction models.   There are a number of numerical and categorical variables in the data set. Some of the variables can be eliminated visually. For example, there a number of fields like MSA and zip codes which are obviously correlated.  All of the numerical variables were scaled to be in the same order of magnitude.  Time is taken as years in epoch time, and initial loan balances are in units on 1x104$. 

The training data and data used for initial GLM step test was the data from the year prior to Jan 1, 2012.

Table "fm_demo"."gb_loans_glm_input"

Column

Type

loan_seq_num

character varying(12)

first_pay_dt

numeric

mature_dt

numeric

zero_bal_dt

numeric

credit_score

numeric

orig_dti

numeric

orig_ltv

numeric

orig_cltv

numeric

orig_upb

numeric

orig_interest_rt

numeric

firstime_buyer_flg

numeric

loan_purpose_cd

character varying(1)

channel_cd

character varying

ppm_flg

character varying

product_type_cd

character varying(5)

property_type_cd

character varying(2)

property_state_cd

character varying(2)

postal_cd

character varying(5)

msa_cd

character varying(5)

seller

character varying(20)

servicer

character varying(20)

occ_status_cd

character varying

foreclose_flg

character varying(1)

orig_loan_term

numeric

num_borrowers

numeric

Running the Aster GLM function with step:

For the demo, only the numerical values have been included as input for the Aster GLM stepwise function.  Below is the sample code for running the GLM function.

drop table if exists fm_demo.gb_loans_glm_step_num

;

select * from glm (

  on (select 1)

  partition by 1

  inputtable('fm_demo.gb_loans_glm_input')

  outputtable('fm_demo.gb_loans_glm_step_num')

  columnnames('foreclose_flag'

              ,'credit_score'

              ,'orig_dti'

              ,'orig_ltv'

              ,'first_pay_dt'

              ,'mortgage_insr_pct'

              ,'num_units'

              ,'orig_upb'

              ,'orig_interest_rt'

              ,'orig_loan_term'

              ,'num_borrowers')

  family('logistic')

  link('logit')

  maxiternum('50')

  step('true')

)

;

Interpreting the Results:

The results of each fit are printed to the screen.  To determine which set of variables form the best model we look at the Akaike information criterion (AIC).  This is one useful measure of the relative quality of the model for the given set of data, it is a function of the number of variables fit, k, and the likelihood function, L:

Smaller values of AIC are better.  Notice that the Akaike measure favors models with a fewer number of variables.

Below is a snip of the output for the run that gave the smallest AIC value of 3349.84.

GLM step Output For the Numerical Features of the Freddie Mac Loan Data Set

Conclusion:

In this blog, we have shown an example of how to use the new Aster GLM step function as an aid for feature selection to generate the most reasonable model for the Freddie Mac foreclosure data.  From the table above the best model based on the AIC value is the one that includes the following variables

credit_score

orig_dti

orig_ltv

first_pay_dt

orig_upb

orig_interest_rt    

num_borrowers

The following variables that do not appear in this fit can be disregarded.

orig_loan_term

mortgage_insr_pct

num_units

This will simplify the predictive models for foreclosures that we will now, train (on data before 2012), test (on accounts still open at the start of 2012), and validate.  The relevant variables from the feature selection can used in number of binary classification models, including for example, logistic regression, support vector machine, and naïve Bayes