Column order in table definition when collecting multi-column statistics

Database
Enthusiast

Column order in table definition when collecting multi-column statistics

I know that the first 16 bytes of a column(s) are what is used in the histogram when collecting stats on a table. I also know that when collecting stats on 2 columns, and the first column is more than 16 bytes, it will not present an accurate picture of the true demographics. I would like to collect stats on the following hypothetical columns of a table defined in this order:

ColA char(16)
ColB char(5)

I would issue the following statement to make sure the small column was in front of the big column to more accurately reflect the demographics as a result of the first 16 byte restriction:

Collect stats on tablea column (ColB,ColA);

I have read in a third party book entitled "Teradata Toolbox" that when collecting multi-column stats, that the order of collection is determined based on the order they are defined in the table. They recommend that larger columns be defined toward the end of the column list in the CREATE TABLE statement (Chapter 9, pg. 126). If that is true, in the statement issued above, the column order would be reversed based on the order in which the columns are defined in the table. In this case ColA is defined before ColB and would therefore be the lead column when collecting the stats, producing a misleading histogram. Could I get some clarification from other members in this forum on how multi-column stats are collected and how the order of the columns in the CREATE TABLE list is relevant? We are in the process of defining a new set of EDW tables and the architects would rather not shuffle the definitions unless absolutely necessary.

Thanks!
2 REPLIES
Enthusiast

Re: Column order in table definition when collecting multi-column statistics

I received an official response from Teradata regarding my question concerning column order in a table and multi-column stats collection. For those that care, here is the response:

"The system ignores user-specified column ordering for multicolumn PARTITION statistics.
This is consistent with non-PARTITION multicolumn statistics and multicolumn indexes.
Instead, the columns are ordered based on their internal field id. Because the system-derived
PARTITION column has field id value of 0, it always takes the first position in multicolumn
statistics.
Fan

Re: Column order in table definition when collecting multi-column statistics

I have a large table with a 2-column NUPI and 2 levels of MLPPI.  The data types and cardinalities are as follows:

NUPI_col_1 = integer, ~100 distinct values

NUPI_col_2 = varchar(20), ~1 billion distinct values

PPI_L1_col = byteint, 4 distinct values

PPI_L2_col = date, several thousand of distinct values

Normally, these are the stats I would collect:

COLLECT STATS tablename COLUMN ( PARTITION )

COLLECT STATS tablename COLUMN ( NUPI_col_1 )

COLLECT STATS tablename COLUMN ( NUPI_col_1 , NUPI_col_2 )

COLLECT STATS tablename COLUMN ( PPI_L1_col )

COLLECT STATS tablename COLUMN ( PPI_L2_col )

COLLECT STATS tablename COLUMN ( PPI_L1_col, PPI_L2_col )

Now, DIAGNOSTIC HELPSTATS also recommends:

COLLECT STATISTICS tablename COLUMN ( PARTITION , NUPI_col_1 , NUPI_col_2 )

COLLECT STATISTICS tablename COLUMN ( PARTITION , NUPI_col_1 , NUPI_col_2 , PPI_L1_col, PPI_L2_col );

Question: Are these additional stats necessary or are they redundant?