Collect Stats - How Performance gets improved?

Database

Collect Stats - How Performance gets improved?

From Books:

When doing collect stats on fields/indexes , system collects the information like: total row counts of the table, how many distinct values are there in the column, how many rows per value, is the column indexed, if so unique or non unique etc

The above information are known as statistics.
___________________________________________________________________________
Example:
Table "Emp_Details" has the primary index "Emp_Name". I use this index "Emp_Name" in my joining condition/Where clause.

case1: I dont have collect stats on "Emp_Details" table.

sel * from Emp_Details
where Emp_Name = 1000;

The above query performs good as i used the primary index in where clause.

case2: I have collected stats on "Emp_Details" table on the index "Emp_Name".

sel * from Emp_Details
where Emp_Name = 1000;

Most of us say the above query performs better than that in the case 1.
____________________________________________________________________________
Questions:

1. Where are these statistics stored?

2. How does the Parsing Engine use the statistics for the better performance of a query?

3. Why do i need to collect stats on the indexed columns , despite the fact
using indexed columns in where clause/joins itself will give better performance?
18 REPLIES
N/A

Re: Collect Stats - How Performance gets improved?

1. Where are these statistics stored?

Statistics are stored in DBC views like ColumnStats,IndexStats and MultiColumnStats

2. How does the Parsing Engine use the statistics for the better performance of a query?

Statistics provide detailed information, and include an exact row count as of the time that the statistics were gathered. Therefore, when the Optimizer (Parsing Engine) finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts (that is, it will trust the statistics)
With such information, optimizer is better prepared to decide the fastest way of query execution.This results in improvised query output which is closer to EXPLAIN suggestions.

3. Why do i need to collect stats on the indexed columns , despite the fact
using indexed columns in where clause/joins itself will give better performance?

Depends , few of the recommendation are as follow:

a)You should always collect stats on NUSIs. The optimizer has little chance of properly using your indexes without stats.

b) The UPI of tables with less than 1000 rows per Amp.

As said , read them as recommendation rather than hard and fast rule.

Re: Collect Stats - How Performance gets improved?

Karam, Thanks for your answers. Still i have the following questions
-------
"1. Where are these statistics stored?

Statistics are stored in DBC views like ColumnStats,IndexStats and MultiColumnStats "

Yes , the statistics informations like column/index, unique values, date & time of collect stats are stored in the DBC's tables as specified by you.
-------
where does the following informations are stored?
1.how many rows per value,
2.is the column indexed, if so unique or non unique etc

Does Parsing engine refers to DBC tables for getting stats information??
------

"when the Optimizer (Parsing Engine) finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts (that is, it will trust the statistics)
With such information, optimizer is better prepared to decide the fastest way of query execution."
------
I have a question on the above answer

for instance,If i use primary index in join/where clause , the optimizer passes the query through hasing algorithm which is the way of finding the best path to data.
How optimizer selects the best path with the help of collect stats informations like number of unique values, how many rows per value and all.?
---------
3. Why do i need to collect stats on the indexed columns , despite the fact
using indexed columns in where clause/joins itself will give better performance?

Depends , few of the recommendation are as follow:

a)You should always collect stats on NUSIs. The optimizer has little chance of properly using your indexes without stats.

b) The UPI of tables with less than 1000 rows per Amp.

As said , read them as recommendation rather than hard and fast rule.
------
Could you please explain the above answers for my 3rd question in detail.??

Thanks in advance..
N/A

Re: Collect Stats - How Performance gets improved?

where does the following informations are stored?
1.how many rows per value,
2.is the column indexed, if so unique or non unique etc
Does Parsing engine refers to DBC tables for getting stats information??
----------
Ans - Data demographic information is stored in DBC tables. Refer -Teradata SQL Reference, Statement and Transaction Processing pdf for details.
For getting stats info , PE read either cache or disk depending on table's usage in recent queries.
For your another question on hashing - my take is :
Hashing is purely used for data storage with dependence on PI value uniqueness.It can point to the exact AMP used for storing a particular PI Where as statistics provide precise data demographic information to the cost based optimiser , such as:
Number of rows in the table
Number of rows per value
Row size
Column demographics
Index demographics

Hope it helps.

Re: Collect Stats - How Performance gets improved?

Karam,Thanks for your answers. can you please assist me in the follwing
.
. I have the follwing query.
.
.SELECT * FROM
.
.FROM DB.A_Table TDR
.
.LEFT OUTER JOIN
. (
. SEL B_Field1,
. B_Field2,
. B_Field3
. FROM DB.B_Table
. WHERE
. B_Field4 = 11005
. AND B_Field3 = '1'
. ) LOAN
.
.ON
. TDR.A_Field1 = LOAN.B_Field1 AND
. TDR.A_Field2 = LOAN.B_Field2
.
.LEFT OUTER JOIN
. (
. SEL C_Field1,
. C_Field2,
. C_Field3
. FROM DB.C_Table
. WHERE
. C_Field4 = 11005
. AND C_Field3 = '1'
. ) LOANH
.
.ON
. TDR.A_Field1 = LOAN.C_Field1 AND
. TDR.A_Field2 = LOAN.C_Field2
.
.WHERE
. TDR.B_Field4 = 11005
. AND LOAN.B_Field3 IS NULL
. AND LOANH.C_Field3 IS NULL;
__________________________________________
.
Informations of the the tables
.DB.A_Table :
.------------
.Total Record Count : 1,000,000
.Indexs :
. UNIQUE PRIMARY INDEX PMY_IDX (A_Field5,A_Field6,A_Field1,A_Field4)
. PARTITION BY RANGE_N on A_Field4
.
.Number of Unique values present in the fields:
.
.A_Field5 34
.A_Field6 5
.A_Field1 200,000
.A_Field4 20
.A_Field2 24
.
.
.DB.B_Table :
.------------
.
.Total Record Count : 800,000,000
.Indexs :
. UNIQUE PRIMARY INDEX PMY_IDX (B_Field1,B_Field4)
. PARTITION BY RANGE_N on B_Field4
.
.Number of Unique values present in the fields:
.
.B_Field1 30,000,000
.B_Field2 1
.B_Field3 3
.b_Field4 128
.
.
.DB.C_Table
.----------
.
.Total Record Count : 90,000,000
.Indexs :
. UNIQUE PRIMARY INDEX PMY_IDX (C_Field1,C_Field4)
. PARTITION BY RANGE_N on C_Field4
.
.Number of Unique values present in the fields:
.
.C_Field1 3,000,000
.C_Field2 1
.C_Field3 3
.C_Field4 128
-----------------------------------------------------------------------
.
.Please guide me on the following
.
1. I need to collect stats on the above 3 tables . How to choose the fiedls for collecting stats. Please recomand me the list of collect stats those needs to be run on the 3 tables.
.
2. On what basis you choose the columns for collecting stats?
.
.Thanks in advance

Re: Collect Stats - How Performance gets improved?

Stats generally should be collected on all indexes, join columns, partitioning columns, the PARTITION column for tables with multi-level partitioning, and columns that exist in the WHERE clause as selection criteria. Carrie Ballinger has a series of blog posts on the Dev Exchanges relating to statistics with some valuable tips found in the comments. (Like collecting sample stats on a unique primary index instead of full stats.)

Stats should be recollected when 10% of the data has changed for the column(s) with collected stats or the stats have aged say 90-180 days. (Your mileage may vary)

You can submit DIAGNOSTIC HELPSTATS ON FOR SESSION and then subsequent EXPLAIN results will included suggested stats to collect that do not exist that *MAY OR MAY NOT* improve the query plan. Collecting stats to satisfy the plan of one query can affect the plan of other queries and should be regression tested.

Find Carrie's discussion regarding mulit-column statistics and the impact of the uniqueness on the first 16 bytes in the histogram. This impact does not affect the number of unique values but does impact other values capture in the histogram.

Hope this helps.

Re: Collect Stats - How Performance gets improved?

Hi,

I don't have collect statistics on my table or any of the columns in table.. For my SQL query, In my explain plan i have 'NO CONFIDENCE'..........
Is there is any possibility to improve the query performance with out taking collect statistics???
Please advise..........................
Teradata Employee

Re: Collect Stats - How Performance gets improved?

Worth mentioning, collecting statistics is a key component of TASM/workload management. Stats collection results in better optimizer estimates which TASM then uses for pre-processing decisions and classification. For instance, putting a groups longer running queries in their own workload definition (WD) or maybe a groups queries with estimated row returns of more than 1M rows. All these preprocessing decisions depend on optimizer estimates.

Re: Collect Stats - How Performance gets improved?

Hi,

Plz help in the below posted query :

I have a table called SYS_PUR having the following definition :

CREATE MULTISET TABLE SYS_PUR ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
FOLIO_KEY INTEGER TITLE 'FOLIO_KEY' NOT NULL,
SCHEME_KEY INTEGER TITLE 'SCHEME_KEY' NOT NULL,
TIME_KEY INTEGER TITLE 'TIME_KEY' NOT NULL,
TRANSACTION_NO DECIMAL(16,0) TITLE 'TRANSACTION_NO' NOT NULL,
TRANSACTION_TYPE VARCHAR(7) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'TRANSACTION_TYPE' NOT NULL,
FROM_DATE DATE FORMAT 'YY/MM/DD' TITLE 'FROM_DATE' NOT NULL,
TO_DATE DATE FORMAT 'YY/MM/DD' TITLE 'TO_DATE' NOT NULL,
SUB_BROKER_KEY INTEGER TITLE 'SUB_BROKER_KEY' NOT NULL,
BRANCH_KEY INTEGER TITLE 'BRANCH_KEY' NOT NULL,
APPLNO VARCHAR(15) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'APPLNO',
ENTRY_DATE DATE FORMAT 'YY/MM/DD' TITLE 'ENTRY_DATE',
CEASED_DATE DATE FORMAT 'YY/MM/DD' TITLE 'CEASED_DATE',
AMOUNT DECIMAL(25,8) TITLE 'AMOUNT',
ACTFLAG CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'ACTFLAG',
FREQUENCY VARCHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FREQUENCY',
PAYMECH VARCHAR(7) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'PAYMECH',
SIPDATE VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'SIPDATE',
BANK VARCHAR(40) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'BANK',
ACTYPE VARCHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'ACTYPE',
INSTRMNO VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'INSTRMNO',
ACNAME VARCHAR(70) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'ACNAME',
MICRNO VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'MICRNO',
BROKCODE VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC,
PURCHASE_REMARKS VARCHAR(31) CHARACTER SET LATIN NOT CASESPECIFIC,
BATCH_NUM INTEGER)
PRIMARY INDEX ( FOLIO_KEY ,SCHEME_KEY ,TIME_KEY ,TRANSACTION_NO ,
TRANSACTION_TYPE ,FROM_DATE ,TO_DATE ,SUB_BROKER_KEY ,BRANCH_KEY ) ;

I have another table called TIME_DIM having definition as follows :

CREATE SET TABLE TIME_DIM ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
TIME_KEY INTEGER TITLE 'TIME_KEY' NOT NULL,
CALENDAR_DATE DATE FORMAT 'YY/MM/DD' TITLE 'CALENDAR_DATE' NOT NULL,
CALENDAR_WEEK_N INTEGER TITLE 'CALENDAR_WEEK_N' NOT NULL,
CALENDAR_WEEK_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'CALENDAR_WEEK_W' NOT NULL,
CALENDAR_MONTH_N INTEGER TITLE 'CALENDAR_MONTH_N' NOT NULL,
CALENDAR_MONTH_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'CALENDAR_MONTH_W' NOT NULL,
CALENDAR_QUARTER_N INTEGER TITLE 'CALENDAR_QUARTER_N' NOT NULL,
CALENDAR_QUARTER_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'CALENDAR_QUARTER_W' NOT NULL,
CALENDAR_YEAR INTEGER TITLE 'CALENDAR_YEAR' NOT NULL,
FINANCIAL_WEEK_N INTEGER TITLE 'FINANCIAL_WEEK_N' NOT NULL,
FINANCIAL_WEEK_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FINANCIAL_WEEK_W' NOT NULL,
FINANCIAL_MONTH_N INTEGER TITLE 'FINANCIAL_MONTH_N' NOT NULL,
FINANCIAL_QUARTER_N INTEGER TITLE 'FINANCIAL_QUARTER_N' NOT NULL,
FINANCIAL_QUARTER_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FINANCIAL_QUARTER_W' NOT NULL,
FINANCIAL_YEAR VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FINANCIAL_YEAR' NOT NULL,
DATE_OFFSET INTEGER TITLE 'DATE_OFFSET',
WEEK_OFFSET INTEGER TITLE 'WEEK_OFFSET',
MONTH_OFFSET INTEGER TITLE 'MONTH_OFFSET',
QUARTER_OFFSET INTEGER TITLE 'QUARTER_OFFSET',
YEAR_OFFSET INTEGER TITLE 'YEAR_OFFSET',
FINANCIAL_YEAR_OFFSET INTEGER TITLE 'FINANCIAL_YEAR_OFFSET',
BATCH_NUM INTEGER)
UNIQUE PRIMARY INDEX ( TIME_KEY )
UNIQUE INDEX TIME_DIM_IDX ( CALENDAR_DATE );

I am trying to update a column of SYS_PUR TABLE USING NON primary key joins as follows :

update SYS_PUR
from
(
select table1.entry_date entry_date,time_dim.time_key time_key from SYS_PUR table1,
(select * from time_dim) time_dim
where time_dim.calendar_date = table1.entry_date
) table2,
(
select table1.ceased_date cease_date,time_dim.time_key time_key from SYS_PUR table1,
(select * from time_dim) time_dim
where time_dim.calendar_date = table1.ceased_date
) table3
set cease_time_key = table3.time_key,
entry_time_key = table2.time_key
where
SYS_PUR.entry_date = table2.entry_date and
SYS_PUR.ceased_date = table3.cease_date

The above query says its going to take some 1000 hours to finish the same in explain plan. Please advice if there is
any work around of doing the same in lesser time

Regards,
Sanjeev

Re: Collect Stats - How Performance gets improved?

Hey Sanjeev.

You shouldn't alias your table as the same table name. This will force a product join, which will give you that high estimated completion time.
Try:

update SYS_PUR
from
(
select table1.entry_date entry_date, td.time_key time_key from SYS_PUR table1,
(select * from time_dim) td
where td.calendar_date = table1.entry_date
) table2,
(
select table1.ceased_date cease_date,td.time_key time_key from SYS_PUR table1,
(select * from time_dim) td
where td.calendar_date = table1.ceased_date
) table3
set cease_time_key = table3.time_key,
entry_time_key = table2.time_key
where
SYS_PUR.entry_date = table2.entry_date and
SYS_PUR.ceased_date = table3.cease_date