Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Teradata Employee

Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

This article delves into the potential excessive use of Multi-Column (MC) Statistics and a process for identifying those that are not providing benefit and should be removed. This will reduce complexity and save resources on the re-collection of these redundant statistics.

Statistics and indexes are one (if not the most) discussed topics in the Teradata database tuning world. How many to add, of what type, recollection frequency, etc. In this article, I want to focus specifically on the potential for removal of MC statistics. You can view this article as a continuation of the multi-column statistics discussion started in Carrie’s blog on the dropping of Multi-Column Statistics, dated 20 July 2009. Here is the URL, and for those that have not read it, I would go read it first, as she does an excellent job of providing background, and then come back to this article: http://developer.teradata.com/blog/carrie/2009/07/should-you-drop-all-multicolumn-statistics-that-ar...

For a specific engagement, we found ourselves looking at a database where developers had “over-embraced” the addition of MC statistics on tables, to the point where in some cases the recollection of the statistics took an inordinate percentage of the overall resources. So our goal was to try to identify those MC statistics that were potentially not adding any benefit and could be removed.

Based on the working hypothesis that any MC combination whose first column, or combined length exceeded the 16 byte limit would probably not be giving additional benefit by the addition of other columns, we took at look at MC statistic combinations that also had statistics collected on the first column of the MC statistics combination.

In looking at the help stats command for the below table, you will see an interesting pattern in the three scenarios I am going to discuss. Notice, that for those MC statistic combinations starting with attribute_a or attribute_b the unique values are the same across all combinations, which was what we expected. However, this trend is not confirmed by the combinations starting with attribute_c, which was a surprise. 

HELP STATISTICS TARGETDB.TARGETTABLE;

Date Time Unique Values Column Names
======== ======== ==================== =============================================================================

09/10/13 13:14:41 1,509,232 ATTRIBUTE_A
09/10/13 13:13:20 1,509,232 ATTRIBUTE_A,COLUMN_A
09/10/13 13:13:44 1,509,232 ATTRIBUTE_A,ATTRIBUTE_C,COLUMN_A,COLUMN_C,COLUMN_D,COLUMN_E,COLUMN_F,COLUMN_G
09/10/13 13:14:19 1,509,232 ATTRIBUTE_A,COLUMN_I
09/10/13 13:13:27 1,509,232 ATTRIBUTE_A,COLUMN_H
09/10/13 13:13:50 1,509,232 ATTRIBUTE_A,COLUMN_C

09/10/13 13:14:13 1,509,232 ATTRIBUTE_B
09/10/13 13:15:12 1,509,232 ATTRIBUTE_B,ATTRIBUTE_A
09/10/13 13:14:22 1,509,232 ATTRIBUTE_B,COLUMN_B
09/10/13 13:14:28 1,509,232 ATTRIBUTE_B,ATTRIBUTE_C
09/10/13 13:14:59 1,509,232 ATTRIBUTE_B,ATTRIBUTE_C,COLUMN_A

09/10/13 13:14:52 1,370 ATTRIBUTE_C
09/10/13 13:15:14 95,661 ATTRIBUTE_C,COLUMN_J
09/10/13 13:14:36 1,373 ATTRIBUTE_C,COLUMN_B
09/10/13 13:14:54 77,343 ATTRIBUTE_C,COLUMN_A
09/10/13 13:15:16 77,434 ATTRIBUTE_C,COLUMN_A,COLUMN_C
09/10/13 13:14:34 1,628 ATTRIBUTE_C,COLUMN_I
09/10/13 13:15:22 1,527 ATTRIBUTE_C,COLUMN_H
09/10/13 13:13:09 1,428 ATTRIBUTE_C,COLUMN_C

To confirm the accuracy of the Unique Values shown in the help statistics, we performed an analysis consisting of the following steps:

1) Ran an explain on a query that consists of a “select (distinct(columnlist)) from table” to see how close the explain estimates matched the unique values, and what confidence was given.

 2) Ran the actual query, to compare the actual results to the explain estimates

3) We then removed the multi-column statistics, leaving only the statistics on the first column of the MC statistics group, and repeated steps 1 and 2

Here are the before removal results from steps 1 & 2. The columns are:

  • HELP STATS UNIQUE VALUES (same as from the HELP STATS)
  • EXPLAIN are the values coming from running the explain plan on the “select (distinct(columnlist)) from table” combinations
  • ACTUAL are the numbers coming from running the actual query
  • CONF - the explain confidence from the explain

As you can see the unique values are supported by the explains, which all had HIGH confidence and are also supported by the actual run of the queries.

Next, we removed the statistics from all of the MC Combinations. You will notice that for the first two scenarios, everything remained the same: the unique values are supported by the explains and actual runs, and had HIGH confidence.

However, for the third scenario, the explains all dropped to LOW confidence, and the explain numbers and actual run numbers deviated greatly from the first column values.

 

What do you think would cause scenario three to be different? Well, if you read Carrie’s Blog carefully, in the third paragraph she mentions differentiation in the first 16 bytes. This is key, because, it is not necessarily the length of the columns concatenated together that matters (in many of scenario #3’s column combinations the first two fields exceed 16 bytes) but rather the length of the data content of those columns concatenated together.

So, what we see happening in scenario #3 is that in all of the MC combination cases, enough of the data from the first two columns fits into the 16 bytes to greatly differentiate the “values”, and hence give us different demographics. This told us that our working hypothesis did not apply to all MC Statistic situations, and that you could remove MC Statistics where the MC unique values equaled the first columns unique values, but you should not necessarily remove MC Statistics where the unique values differed from the unique value of the first column, as that would lead to less confidence. This gave us a working rule set, depicted below:

 

And so, armed with this set of rules we developed a query that will look for this situation, and generate DROP STATISTICS statements for all MC Statistic combinations that fit the REMOVE criteria.  The query uses a version of Dieter Noeth's Stats_Info view (if you do not already have this, I'm sure a web search will turn it up: try http://www.teradataforum.com/attach.htm, or ask your friendly on-site Teradata PS Consultant).

Here is the query, with sample results for this table. If you want to see the intermediate results, simple run the DT select part. Give it a run and see what turns up. Let me and the DevX community know what your results are, especially if you find any substantial savings. We all want to hear about your success. 

SELECT 'DROP STATISTICS ON ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) ||
' COLUMN (' || TRIM(ColumnList) || ');' REMOVE_MC_STATS
FROM
(

SELECT SINGLE.DatabaseName
,SINGLE.TableName
,SINGLE.ColumnName (CHAR(30))
,SINGLE.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') FirstColumnValues

,CASE WHEN FirstColumnValues = MultiColumnValues
THEN 'REMOVE' ELSE 'KEEP ' END Recomendation

,MULTI.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') MultiColumnValues
,MULTI.ColumnName ColumnList

FROM
(
SELECT DatabaseName
,TableName
,ColumnName (CHAR(30))
,StatsType
,CollectDate
,SampleSize
,NumOfRows
,NumOfValues

FROM TOOLSDB.Stats_Info

WHERE statstype = 'C'
AND NumOfValues > 0
) SINGLE,

(
SELECT DatabaseName
,TableName
,ColumnName (CHAR(120))
,StatsType
,CollectDate
,SampleSize
,NumOfRows
,NumOfValues

FROM TOOLSDB.Stats_Info

WHERE statstype = 'M'
AND NumOfValues > 0
AND COLUMNNAME <> 'PARTITION'
) MULTI

WHERE SINGLE.DATABASENAME = MULTI.DATABASENAME
AND SINGLE.TABLENAME = MULTI.TABLENAME
AND SINGLE.COLUMNNAME =
SUBSTR(MULTI.COLUMNNAME,1,POSITION(',' IN MULTI.COLUMNNAME) - 1)

AND RECOMENDATION = 'REMOVE'

AND SINGLE.DATABASENAME = 'TARGETDB'

-- ORDER BY 1,2,3

) DT

ORDER BY 1
;

REMOVE_MC_STATS
=============================================================================================================

DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_A);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,ATTRIBUTE_C,COLUMN_A,COLUMN_C,COLUMN_D,COLUMN_E);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_I);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_H);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_C);

DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_A);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,COLUMN_B);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_C);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_C,COLUMN_A);

And now for extra credit (or more homework, however you want to look at it :) the following query looks for tables that have MC Statistic combinations, but do NOT have statistics on the first column of the MC Statistics. You may find that you can add the single column statistics on the first column, get good results and be able to remove the MC Statistics. 

SELECT 'COLLECT STATISTICS ON ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) ||
' COLUMN (' || TRIM(SUBSTR(ColumnList,1,POSITION(',' IN ColumnList) - 1)) || ');' ADD_SINGLE_STATS
FROM
(
SELECT MULTI.DatabaseName
,MULTI.TableName
,SINGLE.ColumnName (CHAR(30))
,CASE WHEN MULTI.NumOfRows = MULTI.NumOfValues
THEN 'ADD SINGLE COLUMN' ELSE ' ' END Recomendation
,MULTI.NumOfRows
,MULTI.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') MultiColumnValues
,MULTI.ColumnName ColumnList
FROM
(
SELECT DatabaseName
,TableName
,ColumnName (CHAR(120))
,StatsType
,CollectDate
,SampleSize
,NumOfRows
,NumOfValues

FROM TOOLSDB.Stats_Info

WHERE statstype = 'M'
AND NumOfValues > 0
AND COLUMNNAME NOT LIKE 'PARTITION%'
AND POSITION(',' IN COLUMNNAME) > 0
) MULTI

LEFT OUTER JOIN
(
SELECT DatabaseName
,TableName
,ColumnName (CHAR(30))
,StatsType
,CollectDate
,SampleSize
,NumOfRows
,NumOfValues

FROM TOOLSDB.Stats_Info

WHERE statstype = 'C'
AND NumOfValues > 0
) SINGLE

ON SINGLE.DATABASENAME = MULTI.DATABASENAME
AND SINGLE.TABLENAME = MULTI.TABLENAME
AND SINGLE.COLUMNNAME =
SUBSTR(MULTI.COLUMNNAME,1,POSITION(',' IN MULTI.COLUMNNAME) - 1)

WHERE MULTI.DATABASENAME = 'TARGETDB'

AND SINGLE.ColumnName IS NULL
AND Recomendation > ' '

-- ORDER BY 1,2,3
) DT
GROUP BY 1
ORDER BY 1
;

Good Luck!

Dave

22 REPLIES
Junior Contributor

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

Hi Dave,
good article, this triggers some possible enhancements to my stats query :-)

Hmmm, where's that "FieldCount" column in HELP STATS COLUMN calculated from?
A new column comparing the number of columns in the create statement vs. the actual number of columns in the stored stats would be quite usefull.

Btw, it's "Dieter Noeth" instead of "Dieter North", the "oe" is actually a german "o umlaut".
The stats query can be found at the Teradata Forum in the attachement area:
http://www.teradataforum.com/attach.htm

Dieter
Teradata Employee

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics


Interesting approach. Thanks for posting, Dave.

I am guessing here, but I think that the first column in the multicolumn stats where the distinct values did not change were both unique columns (that would be attribute_A and attribute_B). When the first value in a multicolumn stat is unique, that statistic will report the same number of distinct values as just that single leading column does in its single-column stat, because you can't get more distinct than the one first column alone already is. I'm thinking that that is also why you get high confidence after dropping the multicolumn stats for the query that references all the columns in an aggregation. The optimizer sees the first column is unique in the aggregation, and therefore knows exactly how many rows the query will return. It doesn't need information about the other columns.

I tried this both with and without the leading column being unique, and got the same results as you did. A unique leading column gave me the same results as you got for attribute_A and attribute_B; non-unique gave me the same result as with attribute_C.

The 16-byte limit doesn't really come into play when number of distinct values are being calculated. The number of distinct values are calculated before data is placed in the detailed statistics intervals, and that calculation uses all columns in the multi-column stat. Only when the combined values are placed in the detailed intervals does the the truncation take place. The fields that carry the values in the detailed intervals are only 16 bytes long, so with several columns in your multicolumn stat, some truncation may take place there.

There is a different size limitation that comes into play during the calculation of number of distinct values, but that is applied to each CHAR, VARCHAR or BYTE column separately. Only the first 32 bytes of each such column will be included in that calculation, but all the columns will be represented.

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

Hi Dave ,
Useful information, wish I was here couple of days ago.
I came across similar scenario where MC stats was negatively impacting the query and caused spool space errors. By checking the explain plan, I have dropped MC stats and got the query working.
For future analysis, I will leverage your code and run it on my tables and share the feedback.
Teradata Employee

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

Glad you like it. And I sympathize; don't know how many times I have done sometihing, only to find out someone else had a script or process :)
Enthusiast

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

Very useful information on Stats. Does that mean that if a PI/UPI is having X unique value and PI/UPI + MVC is also having identical value of X , then MVC can be taken off ?
Irrespective of 16 byte logic?
Teradata Employee

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

First, we are not really looking at PI's here, just the first field of a Multi Column statistics combination. That given, yes, in a nutshell we are saying that if the values for the first field are fairly unique, then you most probably will not get much gain from multi column combinations. Remember to always test before and after.

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

Dave - appreciate the content - This is exactly what I needed recently

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

HI, this stuff is really cool. we have lot of multi columns stats with more than 16 bytes and i started this exercise.But got aproblem. We have colelcted stats on four columns (col1,col2,col3,col4) and col1 separately.

Here is the info

Help stats on col1: 3,684,013 unique values
Help stats on col1,col2,col3,col4 : 205,474,830
Actual count of distinct col1,col2,col3,col4 :205,474,830

I did a explain plan for select distinct col1,col2,col3,col4 from table. It is showing estimated row count as 3,684,013 (col1) instead of 205,474,830. Is this a strange behavior or can we correct it? Your help is greatly appreciated and we can reduce lot of CPU and IO. thanks in advance
Teradata Employee

Re: Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

So remember, the premise here is that if the help stats number is the same for the first column (col1) as it is for the combination of columns(col1-col4) then that is a "candidate" for removal. In this case that is not true as the single versus multi-column help stats are different.

Are you saying that you ran the script provided and it recommended removing this multi-column stat?