Decision Trees with Teradata Aster

Learn Data Science
Teradata Employee

Decision Tree are one of the simplest, widely used and sophisticated supervised machine learning algorithms. The objective is to create a model that predicts the value of the target given the inputs. In general there are two categories of decision trees depending on the target.  As seen below the predictive model is one of the easiest form of decision representation.

Classification Tree

If the target variable being predicted takes only a discrete set of values then the decision tree is called a classification tree. Below is a classification tree of conditions of survivors from the Titanic Disaster dataset. In this example the target value takes two values survived and died. The hierarchical structure defines how variables interact with one another in the decision making process. Each of the nodes of the tree are defined by a criteria based on which decisions are taken. The leaf nodes or terminal nodes of the tree are the predicted values of the target variable. It is sad that the chances of survival is less (61% died) for male passenger of age greater 9.5.

CART_tree_titanic_survivors.png

Regression Tree

On the other hand if the target variable being predicted takes a continuous set of values then the decision tree is called a regression tree.

Time for Hands on Data Science

Now lets get started with some hands on with the standard iris data set (attached). Use the below script to create and later use ncluster_loader to upload the data into the table.

Create Script

create table demo.iris_table

(

sno integer,

sepal_length double,

sepal_width double,

petal_length double,

petal_width double,

species varchar

)

distribute by replication;

Loading the data

ncluster_loader -h 192.168.100.220 -U db_superuser -w db_superuser -c demo.iris_table iris_data.csv

Problem Definition

We would be building a classification tree  to predicting the values the target variable "species" using the inputs variables sepal_length, sepal_width,petal_length and petal_width. The target variable takes three values 'sentosa', 'versicolor' and 'virginica'


Data Preparation

We would be examining the single decision tree function which requires an attribute table and response table to be set up.

-- Creating the attribute table

create table demo.iris_attr_table

(

pid integer,

attribute varchar,

attrvalue numeric

)  distribute by replication;

-- Creating the response table

create table demo.iris_response_table

(

pid integer,

response varchar

)  distribute by replication;

Now that we have the two required tables in place, lets begin to transpose the original data and populate these tables.

--Inserting petal length

insert into demo.iris_attr_table

select sno as pid,'petal_length' as attribute,petal_length as attrvalue

from demo.iris_table;

--Inserting petal width

insert into demo.iris_attr_table

select sno as pid,'petal_width' as attribute,petal_width as attrvalue

from demo.iris_table;

--Inserting sepal length

insert into demo.iris_attr_table

select sno as pid,'sepal_length' as attribute,sepal_length as attrvalue

from demo.iris_table;

--Inserting sepal width

insert into demo.iris_attr_table

select sno as pid,'sepal_width' as attribute,sepal_width as attrvalue

from demo.iris_table;

--inserting into response table

insert into demo.iris_response_table

select sno as pid,species as response

from demo.iris_table

Single Tree Drive

The single tree drive function requires the AttributeTableName and ResponseTableName parameters to be specified for which we would be using demo.iris_attr_table and demo.iris_response_table tables. Let us store the result in demo.iris_output.

SELECT * FROM Single_Tree_Drive (

ON (SELECT 1) PARTITION BY 1

AttributeTableName ('demo.iris_attr_table')

ResponseTableName ('demo.iris_response_table')

OutputTableName ('demo.iris_output')

Max_Depth (10)

ImpurityMeasurement ('gini')

AttributeNameColumns ('attribute')

AttributeValueColumn ('attrvalue')

ResponseColumn ('response')

IDColumns ('pid')

);

Examining Results

The output table reveals that the petal_width variable is a major deciding factor to identify the species of the flower.

node_sizenode_labelattribute
split_value
left_label
right_label
left_size
right_size
150setosapetal_width0.6setosaversicolor49101
101versicolorpetal_width1.7versicolorvirginica5348

Let us visualize the results in the form of a hierarchical tree. The split_value column in the result indicate the threshold values of the decisions. Here the values are 0.6 and 1.7 respectively.

decision_tree_visual.png

Does it make sense?

Let us take a closer look at the data to know if the split values from the results are really accurate. Below is a boxplot with species variable on the x-axis and the petal_width variable on the y-axis. It is clear from the plot that sentosa flowers always have a petal_width < 0.6

decision_tree_r_visual.png

We can conclude that predicted decision tree were accurate in predicting the right split_values which helped to identify the flowers given its petal_width. Hope you had fun learning decision trees with Teradata Aster.