Values with NULLs and Intervals in the Statistics Histogram

Blog
The best minds from Teradata, our partners, and customers blog about relevant topics and features.
Teradata Employee

I came across an interesting behavior concerning statistics histograms that has to do with NULL values. This posting explains and illustrates this behavior. If you observe this phenomena at your site, know that things are working as designed.

Consider this table, with these columns defined:

 

CREATE MULTISET TABLE TEST_TBL

( COL1 VARCHAR(10),

COL2 INTEGER,

COL3 VARCHAR(10)

) PRIMARY INDEX (COL1)

 

Assume these four rows are inserted into this table, each which contains a NULL value in COL3: 

 

INSERT INTO TEST_TBL VALUES ('ABC', 10, NULL);

INSERT INTO TEST_TBL VALUES ('DEF', 10, NULL);

 

INSERT INTO TEST_TBL VALUES ('GHI', 10, NULL);

INSERT INTO TEST_TBL VALUES ('JKL', 10, NULL);

 

In all cases, COL3 contains NULLs. Now collect a multicolumn statistic on all three columns:

 

COLLECT STATS COLUMN (COL1, COL2, COL3) ON TEST_TBL;

 

Now look at the contents of the histogram:

 

SHOW STATS VALUES COLUMN (COL1, COL2, COL3) ON TEST_TBL; 

 

/** SummaryInfo **/

 /* Data Type and Length: 'CV:10', 'I:4', 'CV:10' */

/* TimeStamp             */ TIMESTAMP '2018-07-06 07:12:41-00:00',

 /* Version               */ 6,

 /* OriginalVersion       */ 6,

 /* DBSVersion            */ '15.10.00.07',

 /* UsageType              */ 'D',

 /* ComplexStatInfo       */ 'ComplexStatInfo',

 /* NumOfBiasedValues     */ 0,     

 /* NumOfEHIntervals      */ 0,     

 /* NumOfHistoryRecords   */ 0,

 /* SamplePercent         */ 0.00,

/* NumOfNulls           */ 4,     

/* NumOfAllNulls         */ 0,        

/* NumOfPartialNullVals */ 4,     

/* PartialNullHMF        */ 1,

 /* AvgAmpRPV             */ 0.000000,

/* MinVal               */ 'Truncated’, 0, ‘Truncated’,            

/* MaxVal               */ 'Truncated’, 0, ‘Truncated’,     

/* ModeVal               */ 'Truncated’, 0, ‘Truncated’,      

/* HighModeFreq         */ 0,                             

/* NumOfDistinctVals     */ 0,                             

 /* NumOfRows             */ 4,                             

 /* CPUUsage              */ 0.000000,

 /* IOUsage               */ 0.000000,

 /* Reserved              */ 0,

 /* Reserved              */ 0,

 /* Reserved              */ 0.000000,

 /* Reserved              */ 0.000000,

 /* Reserved              */ 'T0000D00000S00000',

 /* StatsSkipCount        */ 0,

 /* SysInsertCnt          */ 0,

 /* SysDeleteCnt          */ 0,

 /* SysUpdateCnt          */ 0,

 /* SysInsDelLastResetTS  */ TIMESTAMP '9999-12-31 23:59:59-00:00',

 /* SysUpdLastResetTS     */ TIMESTAMP '9999-12-31 23:59:59-00:00',

 /* IsSampleFollowingTrend*/ 0

);

 

Here are a few things to note about this histogram output, paying particular attention to the fields that are in bold:

 

  • It correctly identifies the number of rows that contain NULLs (NumOfNulls = 4) and the number of rows that contain partial NULLs (NumOfPartialNullVals = 4)
  • However, no standard intervals or bias intervals get created for this statistic (NumOfBiasedValues  = 0, NumOfEHIntervals = 0)            
  • Since there are no histogram intervals, there are no MinVal, MaxVal, or ModeVal values listed. Min/Max/Mode value fields all report ('Truncated’, 0, ‘Truncated’) for their values.
  • While the number of distinct values is reported as zero (NumOfDistinctVals = 0), the number of rows is correctly identified as four (NumOfRows = 4)

Although it may seem odd, this is expected behavior when a histogram is created and all rows in the table contain one or more columns in the statistic that are NULL. Rows which have NULL on any target column are not considered for placement in a histogram interval, either in the standard or biased intervals. The actual value of a multicolumn statistic that includes NULLs is unknown, so statistics collection code cannot know which interval to place that particular value in.

 

Even though there is an attempt when building the histogram to identify the Min/Max/Mode values, all rows are eliminated from consideration as a min, max or a mode value in this case because they all contain one or more NULLs. The presence of NULLs will make it impossible to identify a max, min or a mode value, since NULLs are unknown values.

 

Let’s consider another example where the rows that have been inserted into a table carry these values across four columns:

 

Table 1 has these four columns (C1, C2, C3, C4). There are eight rows inserted into T1 with the following values:

 

R1: a,b,c,?

R2: d,e,?,?

R3: f,?,g,?

R4: ?,h,I,?

R5: j,?,?,?

R6: ?,k,?,?

R7: ?,?,l,?

R8: ?,?,?,?

 

Assume that the following statistics have been collected against T1:

 

COLLECT STATS COLUMN(C1,C2) ON t1;

Result: Only R1, R2 are considered for Min/Max/Mod.

 

COLLECT STATS COLUMN(C1,C2,C3) ON t1;

Result: Only R1 is considered for Min/Max/Mode.

 

COLLECT STATS COLUMN(C1,C2,C3,C4) ON t1;

Result: No row is considered for Min/Max/Mode, so that there will no min/max/mode values.

 

For the four-column case, in the output of SHOW STATS VALUES, Min/Max/Mod will show “empty value symbols”. Empty symbol values are 0 for numeric data types, ‘truncated’ for CHAR/VARCHAR data types, ‘9999-12-31’ for date type, and so on. In this context ‘truncated’ does not mean a truncation has taken place, Rather, it indicates no value qualifies.

 

If NULLs only appear in some of the multicolumn statistic, the values that do not contain NULLs will appear in one or the other of the intervals. The values with NULL in one or more columns will be ignored.

 

The important thing to remember if you see empty value symbols being used is that NULL values are unknowable, so this may result in things that look odd in the histogram.

2 Comments
Tourist

Hi Carrie,

 

The above scenario of showing of Truncated for Null values will not cause Optimizer to generate a Bad plan if we have Multicolumn statistics present against those Null Combinations?

 

Regards,

Shivraj

Teradata Employee

Shivraj,

 

I would not expect those stats to have a negative impact on query plans. Better not to drop multicolumn stats solely for the reason that you are seeing truncation in the value columns, as some of the other information can still be useful to the optimizer.

 

If you are in doubt, or if you believe the statistic is causing bad plans, please open an incident with the support center.

 

Thanks, -Carrie