single_tree_drive

Aster
Teradata Aster is an analytic platform that embeds MapReduce analytic processing with data stores. •Embedded MapReduce analytic processing and unique SQL-MapReduce® framework •Massively parallel data store for multistructured data •Intuitive tools and SQL-MapReduce libraries for rapid analytic development
Highlighted
Enthusiast

single_tree_drive

Hi,

I'm searching to build a decision tree with a binary target variable with both numerical and categorical input variables using the single_tree_predict function but I have a problem with the sintax that I have to use.

 

I didn't undestand how I can manage categorical variables.

 

Someone could help me?

 

 

7 REPLIES
Enthusiast

Re: single_tree_drive

Sorry,

I'm using single_tree_drive... The single_tree_predict will be the next step...

 

Thanks in advance,

Marco.

Teradata Employee

Re: single_tree_drive

 

Marco, 

 

To include categorical values you need to 

  1. Have one attributes table that holds numeric and categorical values, this means you will need to cast the numbers to varchar so it can all live in the same column
  2. Have a table that lists all columns that are categories, all other values will be cast back to numeric

 

See the below example that trains and predicts.

Hope this helps!

Michelle

 

 

/*Input data set, fake data*/
drop table if exists mt186048.single_tree_input_raw;
create table mt186048.single_tree_input_raw
	distribute by hash(id) compress low as 
select 
	generate_series as id
	,random() as var_decimal
	,(random() * 10)::int as var_integer
	,case when random() < 0.5 then 'A' else 'B' end as var_category
	,case when random() < 0.3 then 1 else 0 end as response
from generate_series(1,20);

select * from mt186048.single_tree_input_raw limit 5;
--id	var_decimal		var_integer	var_category	response
--4	0.0268494836054742	2		A		0
--12	0.328261905815452	8		B		0
--8	0.620423850603402	1		A		0
--16	0.512540435884148	1		B		0
--13	0.277574931271374	5		B		0

/*Need the value column to be varchar to hold numeric and category values*/
drop table if exists mt186048.single_tree_inputs;
create table mt186048.single_tree_inputs distribute by hash(id) compress low as 
select id, 'var_category' as attribute, var_category as value from mt186048.single_tree_input_raw
union all
select id, 'var_decimal', var_decimal::varchar as value from mt186048.single_tree_input_raw
union all
select id, 'var_integer', var_integer::varchar from mt186048.single_tree_input_raw;

select * from mt186048.single_tree_input_numeric limit 5;
--id	attribute	value
--1	var_decimal	0.847423478960991
--9	var_decimal	0.130329802632332
--17	var_decimal	0.178869534283876
--1	var_integer	1.0
--9	var_integer	2.0

/*list of all attributes that are categories*/
drop table if exists mt186048.single_tree_input_category;
create table mt186048.single_tree_input_category
	distribute by replication compress low as 
select 'var_category'::varchar as attribute;

select * from mt186048.single_tree_input_category limit 5;
--attribute
--var_category

create table mt186048.single_tree_input_response
	distribute by hash(id) compress low as 
select id, response from mt186048.single_tree_input_raw;

select * from mt186048.single_tree_input_response limit 5;
--id	response
--1	1
--9	0
--17	1
--20	0
--4	1

drop table if exists mt186048.single_tree_model;

select * 
from single_tree_drive ( 
	ON  (select 1)  
	PARTITION BY 1
	
	IDColumns('id') 
	attributeTableName('mt186048.single_tree_inputs')
	attributeNameColumns('attribute') 
	attributeValueColumn('value') 
	responseTableName('mt186048.single_tree_input_response') 
	responseColumn('response') 
	outputTableName('mt186048.single_tree_model') 

	categoricalAttributeTableName('mt186048.single_tree_input_category')

	minNodeSize('3')
	max_depth('10') 
	
	impurityMeasurement('Gini')
)

--message
--Input tables:"mt186048"."single_tree_input_numeric", "mt186048"."single_tree_input_category", "mt186048"."single_tree_input_response"
--Output model table: "mt186048"."single_tree_model"
--Depth of the tree is:4

select * from mt186048.single_tree_model;

select 
	pred.id, pred_label, response
	,case when pred_label = response::varchar then 1 else 0 end as correct
from single_tree_predict( 
	ON mt186048.single_tree_inputs as attribute_table 
	partition by id
	ON mt186048.single_tree_model as model_table DIMENSION 
	
	AttrTable_GroupbyColumns('attribute')
		
	AttrTable_pidColumns('id') 
	ModelTable_attrColumns('attribute') 
	AttrTable_valColumn('value') 
	
	ModelTable_nodeColumn('node_id') 
	ModelTable_sizeColumn('node_size') 
	ModelTable_leftSizeColumn('left_size') 
	ModelTable_rightSizeColumn('right_size') 
	
	ModelTable_splitColumn('split_value') 
	ModelTable_labelColumn('node_label') 
	ModelTable_leftLabelColumn('left_label') 
	ModelTable_rightLabelColumn('right_label')
	ModelTable_leftBucketColumn('left_bucket')
	ModelTable_rightBucketColumn('right_bucket')
) as pred
join mt186048.single_tree_input_response as expected
	on pred.id = expected.id;
	
--suspiciously good for fake data :) 	
--id	pred_label	response correct
--1		1	1	1
--2		0	0	1
--3		1	1	1
--4		1	1	1
--5		1	1	1
--...

 

Enthusiast

Re: single_tree_drive

Ok,

one more question.

 

I don't understand when you created the table 

mt186048.single_tree_input_numeric

and also it seams that you dont't use the table

mt186048.single_tree_inputs

that holds both numeric and categorical in the model but only the first one.

 

Thanks in advance

Enthusiast

Re: single_tree_drive

I builded the same tree that you made starting with

mt186048.single_tree_input_numeric

 But trying to run the single_tree_predict(, I have the error below:

STATEMENT 1: Select Statement failed.  Failed [34 : HY000] [AsterData][ASTERJDBCDSII](34) ERROR: SQL-MR function SINGLE_TREE_PREDICT does not use argument clauses: MODELTABLE_ATTRCOLUMNS, MODELTABLE_LABELCOLUMN, MODELTABLE_LEFTBUCKETCOLUMN, MODELTABLE_LEFTLABELCOLUMN, MODELTABLE_LEFTSIZECOLUMN, MODELTABLE_NODECOLUMN, MODELTABLE_RIGHTBUCKETCOLUMN, MODELTABLE_RIGHTLABELCOLUMN, MODELTABLE_RIGHTSIZECOLUMN, MODELTABLE_SIZECOLUMN, MODELTABLE_SPLITCOLUMN ()

Teradata Employee

Re: single_tree_drive

You're correct, there was a typo from previous testing. I've updated the above code to show the appropriate attributes table. 

 

This was built in Aster Analytics version 6.21. You're error message implies you are running a different version. You can get the appropriate parameter names to use by running the below code or looking in the right documentation version.

 

select funchelp_usagesyntax
from nc_all_functions
where funcname = 'single_tree_drive'
Enthusiast

Re: single_tree_drive

Ok,

I'm running the 6.20.1.01 version.

I tried to test the script you wrote, but I obtained the below error:

 

Executed as Single statement.  Failed [34 : HY000] [AsterData][ASTERJDBCDSII](34) ERROR: permission denied for relation "nc_system"."nc_all_functions" for user "S540157" ()
Elapsed time = 00:00:00.117
 
STATEMENT 1: Select Statement failed.

 

Could you post me the right parameters for that version?

 

 

Enthusiast

Re: single_tree_drive

I'm wrong or the same typo you corrected in the single_tree_drive script is present also in the tree_predict script? It's possible to run a tree_predict using a tree model based on categorical attributes?

Thanks,
Marco