Database

The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-28-2009
01:50 PM

10-28-2009
01:50 PM

I gave a presentation at the Teradata Partners Conference last week on the fundamentals of collecting statistics, where I touched briefly on the rules behind confidence levels. I’m using this article to go into further depth, and offer you more examples of how these confidence levels come about. If you look at query plans much, I’m *confident* you’ll find something of interest in this article.

Confidence levels are determined by the optimizer at the time a query plan is built as an indication of how comfortable the optimizer is by the estimates it comes up with for a given step. High Confidence being the greatest comfort level, and No Confidence being the least.

Some steps don’t give you a confidence level because optimizing the step is a no-brainer. Below is an example of single-AMP using primary index access. Since access by primary index is automatic when the query provides a primary index value, no costing decisions need to be made and confidence levels are immaterial.

Here is an example of an all-AMP access step from a query plan that illustrate a confidence level typical of an all-AMP step:

EXPLAIN

SELECT *

FROM customer

WHERE c_nationkey = 12;

3) We do an all-AMPs RETRIEVE step from CAB30G.customer by way of an

all-rows scan with a condition of ("CAB30G.customer.C_NATIONKEY =

12") into Spool 1 (group_amps), which is built locally on the AMPs.

The size of Spool 1 is estimated with **high confidence **to be

179,308 rows (20,620,420 bytes). The estimated time for this step

is 0.53 seconds.

Here is an example of a primary index access step with a missing confidence level:

EXPLAIN

SELECT *

FROM customer

WHERE c_custkey = 73454;

1) First, we do a single-AMP RETRIEVE step from CAB30G.customer by

way of the unique primary index "CAB30G.customer.C_CUSTKEY =

73454" with no residual conditions. The estimated time for this

step is 0.00 seconds.

There are 4 confidence levels for the optimizer to select from. The following graphic illustrates the pre-requisites for receiving each of these levels:

As shown above, primary index statistics must have been collected on the table being accessed in the step in order to receive High Confidence for that step. Further, there must only be a single predicate (single or multicolumn) with statistics collected on that predicate. There are two example examples below that illustrate this requirement.

For the two following examples, statistics were collected on the following:

- T_Txnkey (primary index)
- T_Quantity (column)
- T_Shipdate (column)

The following explain shows that High Confidence results when primary index stats have been collected AND there is a single predicate:

EXPLAIN

SELECT * FROM txn

WHERE t_quantity = 5;

3) We do an all-AMPs RETRIEVE…with a condition of ("CAB.txn.

T_QUANTITY = 5.00") into Spool 1…The size of Spool 1 is

estimated with **high confidence **to be 199,878 rows…

This next example illustrates that even with statistics collected on all referenced columns, Low Confidence is returned if there is more than one predicate on the table:

EXPLAIN

SELECT * FROM txn

WHERE t_quantity = 5

AND t_shipdate = 980101

3) We do an all-AMPs RETRIEVE … with a condition of

("(CAB.txn.T_SHIPDATE = DATE '1998-01-01') AND

(CAB.txn.T_QUANTITY = 5.00)") into Spool 1…The size of

Spool 1 is estimated with **low confidence **to be 3,106 rows. . .

If multicolumn stats had been collected on (T_Quantity,T_Shipdate), then the step would have qualified for High Confidence, since a multicolumn statistic is considered a single predicate. Note that for a multicolumn statistics to be used by the optimizer, all participating columns must be expressed in equality conditions within the query.

For the vast majority of cases, the optimizer will cost a Low Confidence step the same as a High Confidence step. So there is little, if any, value in going to extreme lengths to achieve High Confidence steps. By the way, statistics are considered “collected” whether they are fully collected are or you have collected them taking advantage of the USING SAMPLE option. I just now ran the explain of the first query in this article after re-collecting statistics on c_nationkey specifying USING SAMPLE with a 2% sampling rate. I still received High Confidence for the step.

Practically all join steps will give you Low Confidence. This is because the optimizer is not aware of what the match will be between the tables being joined. Below is a graphic that illustrates this point.

There is an exception to this. Under special circumstances a join step can receive High Confidence. The requirement is that a unique primary index is on one side of the join with stats collected, and there is a predicate in the WHERE clause that passes a single value for this unique primary index. An example follows:

EXPLAIN

SELECT * FROM txn,part

WHERE t_partkey = p_partkey <== UPI with stats, 1 row guaranteed

AND p_partkey = 88;

3) We do an all-AMPs RETRIEVE step from CAB.txn by way of index # 4

"CAB.txn.T_PARTKEY = 88". . . Spool 2 is estimated with **high confidence**

to be 3 rows. . .

4) We do a single-AMP JOIN step from CAB.part by way of the

unique primary index "CAB.part.P_PARTKEY = 88“. . .joined to

Spool 2 (Last Use). . . The size of Spool 1 is estimated with

** high confidence **to be 3 rows. . .

If selection on the table is by NUSI, even though there are no statistics collected on the NUSI, Low Confidence will result . However, there must only be a single NUSI used as a predicate. If there are multiple NUSI’s referenced, No Confidence is returned. Below are two examples that illustrate this:

EXPLAIN

SELECT * FROM parttbl

WHERE p_size = 5 ; <== NUSI but no statistics

3) We do an all-AMPs RETRIEVE step. . . by way of

index # 4 "CAB.parttbl.P_SIZE = 5" . . . Spool 1 is estimated with **low confidence **to be 103,487 rows.

EXPLAIN SELECT * FROM parttbl

WHERE p_size = 5 <== NUSI but no statistics

AND p_type NE 'small plated tin‘; <== NUSI but no statistics

3) We do an all-AMPs RETRIEVE … by way of

index # 4 "CAB.parttbl.P_SIZE = 5" with a residual condition

of ("CAB.parttbl.P_TYPE <> 'small plated tin'") …

Spool 1 is estimated with **no confidence **to be 92,342 rows .

The reasoning here is that random AMP samples provide basic demographic information about all NUSIs on a table. Since that information is available even without statistics collection on the NUSI, a Low Confidence is acceptable for such single NUSI access.

If you’ve got complex, multi-step plans, chances are when you look at the steps at the bottom of the plan, most of them display No Confidence. One contributing factor is that lower confidence levels coming from previous steps will dominate the confidence level of the steps they feeds into.

Below is a query and its explain that illustrate this point:

In this example, full statistics on all indexes and referenced columns have been collected on both the Lineitem and the Parttbl tables. However no statistics at all have been collected on the Supplier table.

The following table illustrates the confidence levels at each step of this 3 table join. Note that even though the Parttbl has all possible stats collected, the step that joins the Parttbl to the spool created by the join of the Lineitem and Supplier table shows No Confidence. That is because the step feeding into it has No Confidence because no stats had been collected on Supplier.

Here is the entire explain from which information in the above table was drawn.

EXPLAIN SELECT *

FROM lineitem,parttbl,supplier

WHERE p_partkey = l_partkey

AND s_suppkey = l_suppkey

AND p_size = 3

AND l_shipdate = 950101

AND s_nationkey = 6;

1) First, we lock a distinct CAB."pseudo table" for read on a

RowHash to prevent global deadlock for CAB.supplier.

2) Next, we lock a distinct CAB."pseudo table" for read on a

RowHash to prevent global deadlock for CAB.parttbl.

3) We lock a distinct CAB."pseudo table" for read on a RowHash to

prevent global deadlock for CAB.lineitem.

4) We lock CAB.supplier for read, we lock CAB.parttbl for

read, and we lock CAB.lineitem for read.

5) We do an all-AMPs RETRIEVE step from CAB.lineitem by way of an

all-rows scan with a condition of ("CAB.lineitem.L_SHIPDATE =

DATE '1995-01-01'") into Spool 2 (all_amps) (compressed columns

allowed), which is redistributed by the hash code of (

CAB.lineitem.L_SUPPKEY) to all AMPs. Then we do a SORT to

order Spool 2 by row hash. The size of Spool 2 is estimated with

high confidence to be 74,730 rows (9,415,980 bytes). The

estimated time for this step is 16.61 seconds.

6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a

RowHash match scan, which is joined to CAB.supplier by way of

a RowHash match scan with a condition of (

"CAB.supplier.S_NATIONKEY = 6"). Spool 2 and CAB.supplier

are joined using a merge join, with a join condition of (

"CAB.supplier.S_SUPPKEY = L_SUPPKEY"). The result goes into

Spool 3 (all_amps) (compressed columns allowed), which is

redistributed by the hash code of (CAB.lineitem.L_PARTKEY) to

all AMPs. Then we do a SORT to order Spool 3 by row hash. The

size of Spool 3 is estimated with no confidence to be 13,105 rows

(2,738,945 bytes). The estimated time for this step is 0.06

seconds.

7) We do an all-AMPs JOIN step from CAB.parttbl by way of a

RowHash match scan with a condition of ("CAB.parttbl.P_SIZE =

3"), which is joined to Spool 3 (Last Use) by way of a RowHash

match scan. CAB.parttbl and Spool 3 are joined using a merge

join, with a join condition of ("CAB.parttbl.P_PARTKEY =

L_PARTKEY"). The result goes into Spool 1 (group_amps), which is

built locally on the AMPs. The size of Spool 1 is estimated with

no confidence to be 13,105 rows (3,826,660 bytes). The estimated

time for this step is 0.47 seconds.

8) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

Confidence Levels are associated to query plan steps in an effort to assess the confidence the optimizer has in the accuracy of the demographic information available for thatstep. For the most part, High Confidence and Low Confidence steps will be treated the same by the optimizer when it comes to making plan decisions. Even with good statistics collected, there will be cases where Low Confidence is delivered, especially for steps near the bottom of a complex query plan.

6 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-07-2009
11:02 AM

11-07-2009
11:02 AM

Dear Carrie

Allow me to be bit selfish in asking you an off-the-track but a related question.

While collecting statistics on a table,I got "out of spool space " error,so much so that it resulted in mini cyl-pack alert. I took off the session but what's the solution for such a situation .Why I got spool error while collecting stats?

Does that mean I cannot collect stats even on that table?

Looking for your valued insight.

Allow me to be bit selfish in asking you an off-the-track but a related question.

While collecting statistics on a table,I got "out of spool space " error,so much so that it resulted in mini cyl-pack alert. I took off the session but what's the solution for such a situation .Why I got spool error while collecting stats?

Does that mean I cannot collect stats even on that table?

Looking for your valued insight.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-09-2009
01:56 PM

11-09-2009
01:56 PM

Statistics collection has always been resource-intensive. A full table scan is done on the table and there are several aggregate operations performed for each statistic in order to calculate this type of information:

Distinct values

Maximum rows/value

Rows/null

These aggregations involve spool. Running out of spool space may be a result of collecting statistics on columns/indexes whose values are skewed, or just not having a large enough spool allocation. (You could run a query/macro to calculate PeakSpool to determine if skew is a factor.)

You could increase the spool allocation for the collect statistics user, to make sure enough spool space is always available. However, if you are getting mini-cylpacks while collecting stats, that is a sign that there is not enough physical space to accommodate the work being done. You could eliminate old or unwanted tables as a temporary fix, run the packdisk utility more regularly, or longer-term consider adding more disk drives.

It is important to address the mini-cylpack problem, as getting mini-cylpacks will cause the statistics collection process (and other work) to run longer.

Distinct values

Maximum rows/value

Rows/null

These aggregations involve spool. Running out of spool space may be a result of collecting statistics on columns/indexes whose values are skewed, or just not having a large enough spool allocation. (You could run a query/macro to calculate PeakSpool to determine if skew is a factor.)

You could increase the spool allocation for the collect statistics user, to make sure enough spool space is always available. However, if you are getting mini-cylpacks while collecting stats, that is a sign that there is not enough physical space to accommodate the work being done. You could eliminate old or unwanted tables as a temporary fix, run the packdisk utility more regularly, or longer-term consider adding more disk drives.

It is important to address the mini-cylpack problem, as getting mini-cylpacks will cause the statistics collection process (and other work) to run longer.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-10-2009
11:47 AM

11-10-2009
11:47 AM

Many thanks for the direction...

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

12-31-2009
01:44 PM

12-31-2009
01:44 PM

Hello Carrie,

Could you please explain what does the confidence levels mean by the

recommended statistics (enabled by diagnostic helpstats ... ).

Could you please explain what does the confidence levels mean by the

recommended statistics (enabled by diagnostic helpstats ... ).

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

03-17-2010
11:48 AM

03-17-2010
11:48 AM

Hi Carrie,

Three simple questions:

Is confidence level also dependent on sample rate. Most databases, only sample table data to generate statistics. This is especially useful for a large table. If it is a small table, statistics may be accurate.

If I am looking for a non indexed & non partitioned value, only way Teradata gets that data is by full table scans. Does Teradata indicate statistics with high confidence, since it knows that FTS is needed in this case and it knows how long that takes.

I guess time that explain plan indicates is dependent on load on the system at that point in time. In other words, for the same SQL, explain plan differs in its estimation based on what resources are available. How does this affect AJI that is getting refreshed, which may be used by SQL.

Three simple questions:

Is confidence level also dependent on sample rate. Most databases, only sample table data to generate statistics. This is especially useful for a large table. If it is a small table, statistics may be accurate.

If I am looking for a non indexed & non partitioned value, only way Teradata gets that data is by full table scans. Does Teradata indicate statistics with high confidence, since it knows that FTS is needed in this case and it knows how long that takes.

I guess time that explain plan indicates is dependent on load on the system at that point in time. In other words, for the same SQL, explain plan differs in its estimation based on what resources are available. How does this affect AJI that is getting refreshed, which may be used by SQL.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

04-13-2010
03:42 AM

04-13-2010
03:42 AM

Hi Iam Raghavender from India,

If anyone has a script generator which will automatically generate an mload, fload, or fexp generic template, can you please send me

If anyone has a script generator which will automatically generate an mload, fload, or fexp generic template, can you please send me

Copyright © 2004-2015 Teradata Corporation. Your use of this Teradata website is governed by the Privacy Policy and the Terms of Use, including your rights to materials on this website, the rights you grant to your submissions to this website, and your responsibilities regarding your conduct on this website.

The Privacy Policy and Terms of Use for this Teradata website changed effective September 8, 2016.