How to decode the binary statistics stored in dbc tables

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
N/A

Collected statistics are a valuable source of information in Teradata, not only for the optimizer, but also for developers or DBAs. Spotting outdated statistics (which might lead to bad optimizer plans) and implementing/monitoring a stats recollection process are common tasks, which need the current statistics as input.


The HELP STATS function returns information about stats on a table or column level, but there's no built-in way to extract info about all stats on all tables in a single query. The DBC views ColumnsStatistics, MultiColumnStats, IndexStats expose the stats data, but those statistics are stored in a binary format.

This internal format is documented in the "SQL Request and Transaction Processing" manual, although the exact details and the differences between 32- and 64-bit are somewhat vague. However utilizing that knowledge to decode the binary stats is limited by a restriction in Teradata SQL: BYTEs can't be casted to any other type. You know those 8 bytes represent a float, but you can't convert them to a readable value. Therefore the most important step to implement my stats query was discovering how to cast bytes to numeric values using the HASHBUCKET function.

I started to develop a query to decode Teradata's binary stats way back in 2003 on V2R4, over the following years I enhanced and fixed it several times (with a lot of Trial & Error) to accomplish new internal stats versions and the changes from 32 to 64-bit. This old stats query is available in the attachment area of John Hall's TeradataForum (btw still the best independent source of Teradata knowledge) and is used at many customer sites.

The latest release posted here extracts even more details without much additional overhead and now includes info about an index or a column from the table's partitioning expression without collected statistics. This StatsInfo view provides following columns, an asterisk "*" indicates information similar to HELP STATS output and "+" an enhanced or new column:

 

DatabaseName

"ColumnName" is a comma-separated list of the names of the columns for which statistics are reported.

To remove those double quotes do a global Search & Replace within the source code replacing '"' with '', but this will also remove the double quotes from "CollectStatement" and "HelpStatement"

 

TableName

*

ColumnName

+

IndexName Name of the index, if it exists
*

ColumnCount

The number of columns in the stats or index definition. The "FieldCount" of HELP STATS is not exactly the same, it might be less, as this is the number of columns actually stored in the stats.

By default only 16 columns are supported although Teradata allows up to 64. Hopefully there are no indexes or stats with more than 16 columns, but additional columns are already in the source code, just commented out. Search for "uncomment" when you need to modify it.

+

MissingStats

Indicates an index or a column from the table's partitioning expression without collected statistics: Y/N
+ DateIncluded

"ColumnName" includes a DATE or TIMESTAMP: Y/N

+

PartitionColumn

Stats include a column which is used in the table's partitioning expression: Y/N

+

PartitionLevels

Number of levels in the table's partitioning expression, zero means not partitioned

+

PartitionsDefined

The number of partitions defined within that table

Only calculated for single level partitioning and TD12, previous releases will always show 65535 instead of the correct number. Might be 65535 on current releases, too, due to (yet) unknown reasons sometimes the TableCheck column in dbc.IndexConstraints isn't created correctly.

+

TableType

TempTbl --> Global Temporary Table

Tbl     --> Table

JoinIdx --> Join Index

HashIdx --> Hash Index

NoPITbl --> No Primary Index Table

+

StatsType

Stats collected on:

UPI  --> Unique Primary Index

NUPI --> Non-Unique Primary Index

USI  --> Unique Secundary Index

NUSI --> Non-Unique Secondary Index

VOSI --> Value Ordered NUSI

Part --> Pseudo column PARTITION

Col  --> Single non-indexed column

MCol --> Multiple non-indexed columns

*

CollectDate

Date and time when statistics were last collected
*

CollectTime

 

CollectTimestamp

 

CollectDuration

How long did the Collect Stats run.

This is not always correct as the calculation is based on the LastAlterTimestamp, which is not only set when stats are collected, but also during a ALTER TABLE modifying the column attributes (this should be a rare case, but can't be fixed).

Additionally on some systems there's a fixed deviation (due to some unknown reason, Time zone settings on TD vs. OS level?), which can easily be spotted when comparing "LastAlterTimestamp" and "CollectTimestamp". Search for "TimeAdjust" to add or subtract that fixed value.

Old statistics might generate another deviation when the system time zone was changed since collecting. Recollecting those stats should fix it.

*

NumericStats

Indicates single column stats on a numeric or date column: Y/N
*

SampleSize

Sample size used for collect stats, NULL if not sampled

*

StatsVersion

Internal version of statistics:

1 --> pre-V2R5

2 --> V2R5

3 --> TD12

4 --> TD12.0.3.1/TD13.0.0.21

* NumAMPs

The number of AMPs from which statistics were collected, usually the number of AMPs in the system, 1 for an empty table

*

NumIntervals

The number of intervals in the frequency distribution histogram containing the column statistics.

*

AvgAmpRPV

Overall average of the average rows per value from each AMP, only for NUSIs, otherwise zero

*

OneAMPSampleEst

The estimated cardinality of the table based on a single-AMP sample

*

AllAmpSampleEst

The estimated cardinality of the table based on an all-AMP sample

*

NumNulls

The number of rows containing NULLs for one or more columns in "ColumnName"

*

NumAllNulls

The number of rows containing nulls for all columns in "ColumnName"

*

NumRows

The cardinality of the table, i.e. the number of rows
*

NumValues

The number of unique values

*

ModeFreq

The frequency of the most common value
*+

ModalValue

The most common value, only for single column numeric or date stats
*+

MinValue

The minimum value, only for single column numeric or date stats

*+

MaxValue

The maximum value, only for single column numeric or date stats

+

DataSize

The size in bytes needed to store the statistics.

If greater than 16 only partial data is stored.

For single column stats the actual size for any numeric column is 8, as it's always stored as float.

+

FieldType

The datayte of a single column stats, NULL for multiple columns
+

LastAlterTimestamp

The last time when statistics have been collected (or this column/index was modified by an ALTER TABLE)

+

LastAccessTimestamp

The last time and the number of times this column/index was used in queries, the same info is found in dbc.TablesV and dbc.Indices

Of course this is only available when it has been enabled via dbscontrol ObjectUseCountCollectRate

+

AccessCount

+

TableId

To facilitate additional joins to other system tables

+

IndexNumber

From dbc.Indexes - 1..128: Internal index number, >128: pseudo index = multicolumn stats
+

FieldId

From dbc.TVFields - NULL for multi-column or PARTITION stats
+

CollectStatement

COLLECT STATS statement to (re-)collect the stats

I got several request for syntax changes:

- use the legacy syntax "COLLECT STATS ON table COLUMN ..." (deprecated in TD14) instead of the recommended "COLLECT STATS COLUMN ... ON table"

- use an existing "IndexName" instead of "ColumnList"

Finally i included all four possible variations in the source code, now it's your choice, just uncomment the preferred syntax:

- Recommended syntax with column names (default)

- Recommended syntax with index or column names

- Legacy syntax with column names

- Legacy syntax with index or column names

+

HelpStatement

HELP STATS statement to get the stats details on column or index level

TD14 stores statistics in a more readable format in a system table (dbc.StatsTbl already exists in TD13, but wasn't populated). The new dbc.StatsV returns much of the information i extracted here, but i like to get more details, so i wrote a new version for TD14

Please report any issues as this version has not been tested that much, i.e. you're my beta testers :-)

I'm open for enhancement requests, too, but I don't think there's much to add to this query without joining additional tables. You can do that on your own, that's why there's the TableId/FieldId/IndexNumber  column.


2011-12-28:

  Already fixed the first bug, zero PartitionLevels for PARTITION stats.

  Added PartitionsDefined

2011-12-31:

  Fixed 'Invalid Date' error for empty tables

2012-02-01:

  Added IndexName

  Additional syntax variations for "CollectStatement"

2012-10-03:

  Fixed wrong calculation of ColumnCount/DataSize/DateIncluded for some stats involving PARTITION

  Added IndexNumber and FieldId

2012-10-23

  Fixed a wrong calculation of PartitionColumn when a double quoted ColumnName is used in the partition definition

2012-11-24

  Fixed: Wrong calculation of PartitionColumn when a ColumnName is used in different upper/lower case in the partitioning expression (dbc.TableConstrains.TableCheck is defined as CASESPECIFIC)


Attached files:

teradata_statistics.pdf

A short description of the binary format

stats_32bit.sql

For 32-bit systems: MP-RAS and Teradata Express for Windows

stats_64bit.sql

For 64-bit systems: SLES, Windows Server and Teradata Express VMWare

stats_64bit_udf.sql

For 64-bit systems running TD13.10: SLES, Windows Server and Teradata Express VMWare

Replaces the ugly HASHBUCKET calculations with SQL UDFs, the source code looks much cleaner now

Needs a Search & Replace to change '<dbname>' with a valid database name:

SQL UDFs must be fully qualified (unless they are created in the "syslib" database) because the database name is NOT resolved when the view is created.

35 Comments
N/A
Thank for your effort to decode the stats info!
I used the older versions regularly in my daily work and and will test the latest verison as well.
It gives really good information and helps also to indentify stats which might not be usefull but are resoucre intensive in the collection process.
I looked into the scripts and got lost. Can you please help me understand which one should I be running for a TD13?
N/A
Assuming a 64-bit TD13: stats_64bit.sql

Dieter
Thanks for the update(s) Dieter. When v13 came out and we moved to 64-bit Linux, I also tailored the old script to recognize bitness and version to decode properly. I did this all in one script, but it did require a function to be created which provided the bitness. Let me know if anyone wants that single code stream.

As you mention, this all goes away in TDv14. I beta tested that and love the changes!
I have used your stats sql extensively over the years. We base our stats collection process on this view. I look forward to using this latest update.

Thank you very much for your continued support over the years in this area. I am glad to see that Teradata has taken this up and is introducing their own version in TD14.
dnoeth,

!!!!YOU ARE THE BEST!!!

I tweaked your query and used it for quick compression analysis.

I used the ratio of ModeFreq to NumRows, to do compression analysis.

Also, your query is really helpful to clone the stats across the environments.

Thanks a ton!!!!!
Teradata Employee
TD14 introduced a new syntax SHOW STATISTICS, which will present you the detailed stats. Still, appreciate your great efforts!
Hi Diether,
I just tried one code, i saw in some of your post and tweaked a little to macth my requirement. Its gives the same result as you statsinfo view..just wanted to share..comments welcome

SELECT c.databasenamei AS databasename , b.tvmnamei AS tablename
FROM dbc.tvfields a , dbc.tvm b , dbc.dbase c
WHERE a.tableid = b.tvmid
AND b.databaseid = c.databaseid
AND a.fieldstatistics IS null
AND databasename = 'databasename'
-- and b.tablekind='T'
and (databasename,tablename)
not in (
SELECT c.databasenamei AS databasename , b.tvmnamei AS tablename
FROM dbc.tvfields a , dbc.tvm b , dbc.dbase c
WHERE a.tableid = b.tvmid
AND b.databaseid = c.databaseid
AND a.fieldstatistics IS not null
AND databasename = 'databasename'
-- and b.tablekind='T'

union

sel tp.databasename,tp.tablename
from
(
SELECT d.databasenamei AS databasename , t.tvmnamei AS tablename,
MAX(SUBSTR(i.IndexStatistics, 1, 80)) AS STATS
FROM
dbc.dbase d
JOIN dbc.tvm t
ON d.databaseid = t.databaseid
JOIN dbc.tvfields c
ON t.tvmid = c.tableid
JOIN dbc.Indexes i
ON c.tableid = i.tableid
AND c.fieldid = i.fieldid
/*** Add list of DBs to exclude***/
WHERE d.DatabaseName IN ('databasename')

AND
(i.IndexType IN ('M','S','K','U','H','O','V','P','Q'))

-- and t.tablekind='T'
group by 1,2
HAVING STATS IS not NULL
) tp
)
group by 1,2

Thank for your effort to decode the stats info!
I used it once.
N/A
I posted a new version of my stats query fixing some bugs.
N/A
I posted a new version of my stats query fixing another bug.

Hi Dnoeth,

This is the best information regarding stats, so far I can seen.

-Thanks

Teradata Employee

Thanks, this is really helpful

Teradata Employee

Just installed on 14.10.00.00. So far, so good! I'll let you know of any issues.

Teradata Employee

This is a very helpful article.

I want add the intervals (basically this portion from SHOW STATS  /** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/ ). What are the tables/views that I should look at?

The goal is to scan the dbc tables/views to get the detailed histogram so it can be used on TD14 as well as an earlier version when necessary.

Thank you!

N/A

The interval details are still stored in a binary format in TD14.

This new format is much more complicated, a BLOB instead of VARBYTE, i wouldn't try to reverse-engineer it (at least not in SQL).

You can get the main info from dbc.StatsV or my StatsInfo and then use HELP/SHOW STATS for the details.

Dieter

Teradata Employee

Got it, thank you for the reply!

Teradata Employee

Hi Dnoeth,

I was trying to select from the view and I was seeing the following error.

2616-Numeric overflow occured during computation. I isolated the problem to the following piece of code

CAST( (-1**(NumNullsw1 / 32768))

* (2**((NumNullsw1/16 MOD 2048) - 1023))

* (1 + ((NumNullsw1 MOD 16) * 2**-4) + (NumNullsw2 * 2**-20)

+ (NumNullsw3 * 2**-36) + (NumNullsw4 * 2**-52)) AS FLOAT),

I have tried casting it to FLOAT and DECIMAL but the error persists.

Please advise.

N/A

Which TD release? 32 or 64bit?

I usually encountered that error when the new statistics version 4 was introduced.

Could you check if there's a really old statistic with version 2 or 3?

Are you 100% shure you're running the latest version? (Sorry to ask, but this was the cause for most problems like this)

Could you show some rows (mainly the STATS binary data) from running the nested Derived Table (before the HASHROW...)?

Dieter

Teradata Employee

Hi Dnoeth,

Apologies , There was a small mistake from my side as I was running the 32bit script on a 64 bit operating system.

I executed the 64 Bit script and its working great now!

Thanks for your quick response!!

Hi Dnoeth,

I have few questions on the o/p of stats view :

1. What is the meaning of Col in status type column ?

2. If i have a index on col1 and col2 combination and my stats is on col1,col2 combination and i dont have on individual columns col1 and col2, it shows as status type=col. Secondly, do i need to stats take stats on individual columns or a combination would suffice ?

3. If i have a col say col1 , which is PPI in the table. Then do i need to take stats on col1 or a stats on "partition" is enough.

Thanks !

N/A

#1: should indicate statistics on a single non-indexed column

#2: If it's a multi-column stats it should be "MCol" instead of "Col", seems to be a buggy calculation, could you show the DDL and the COLLECTs for this table?

#3: You should collect stats on both, column and PARTITION. Check Carrie Ballinger's blogs for details:

http://developer.teradata.com/blog/carrie/2012/04/teradata-13-10-statistics-collection-recommendatio...

Dieter

N/A

Hi Dieter

I need to drop the table statistics and then rebuild them.

To find out what statistics to rebuild I am using the view StatsInfo.

1. Selecting rows where MissingStats='N', would I only get the

statistics that already exist?

2. When I create a simple table (create, table tab( col varchar(32)))

and collect statistics (COLLECT, STATS USING SAMPLE COLUMN(col) ON tab),

the view returns (COLLECT, STATS INDEX("col") ON "user"."tab").

How can I get the information about 'using sample'?

Thank you!

N/A

#1: yes

#2: This is by purpose.

You don't need to specify SAMPLE anymore in TD14, the collect stats process automatically remembers previous Sample/MaxIntervals/MaxValueLength settings.

I just want to log on for saying thank you. This script help me in an urgence. Thanks a lot!

N/A

Thank You Dieter Nöth. I hope there comes a day when I can return the favor. Robert Turner

I would love to use this code. However I am not allowed to access the DBC tables, only the DBC.views.

Teradata Managed services seem extremely reluctant to give me any access to the tables.

So is there any guide how I could update this code to use information that is stored in the views only?

N/A

@Hibellm:

You only need this query if you're on a release lower than TD14.

Since TD14 there's a builtin view returning similar information, dbc.StatsV, try this instead, every user (=PUBLIC) should have access rights.

Hi Dieter,

 

We are migrating from appliance 2700 to 2800. Current DBS version is 14.10 (64 bit) and it will be same after migration. 

Does your statsinfo is enough to take backup of all current STATS (collected) available in system, so that we can refresh them on 2800 after migration ??

Teradata Employee

Why would you not just do a show statistics on all the tables and capture the actual statements for recollection?

 

SELECT 'SHOW STATISTICS ON "' || TRIM(s.DATABASENAME) || '"."' || TRIM(s.TABLENAME) || '";' Table_Stats
  FROM DBC.StatsV s
 WHERE s.DATABASENAME LIKE ANY ('xxxx%')
 GROUP BY 1
 ORDER BY 1
;

Hi David,

Thanks for reply. But it will not generate collect stats statements readily.

 

@dnoeth

we neeed to generate collect stats statements as single statement per table in below latest syntax :

COLLECT STATISTICS
             COLUMN ( **bleep**, abc, xyz ) ,
             COLUMN ( **bleep**) , ....... .......,COLUMN ( abc) 
ON dbname.tablename ;

 

How can we achieve this if current stats collected on that table is in different syntax? 

 

N/A

@kiranzope: Of course you need to run the SHOWs created by David's approach to get the final COLLECT statements, but this the easiest way.

 

Btw, when you migrate to new hardware (done by TD Customer Service?) recollecting stats is usually part of the migration.

IMHO there's no immediate need to recollect everything (only stats on NUSIs include some info based on the number of AMPs).

Teradata Employee

I guess it depends on what you mean by "readily" :)

 

If you are just using a query tool and looking for a directly runnable output then you will have issues with just the shows. Most of us get around this by running a two step BTEQ process: first generate the show statements, then run those to create an executable file of the stats. You should really look into do this as more and more functionality is being kept directly inside the database stats process.

 

As for what to execute, as Dieter said you do not have to do all of this for upgrades. In addition to the SI's, I like to add partitions and partitioning columns and then a paranoid select summary stats on every table

 

thanks

 

Dave

 

Teradata Employee

Dave, you're not paranoid to recollect summary stats after h/w upgrade.   Engineering recommends doing that as well, as summary stats carry single and all AMP random AMP samples, which if you've added AMPs, may have changed.  RAS numbers are used during extrapolations if there are any stale stats.  A good safety-net and very quick to perform.

Thanks @dnoeth  @David_Roth  @carrie for suggestions..