Can We Speak Confidentially? Exposing Explain Confidence Levels

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Teradata Employee

Can We Speak Confidentially? Exposing Explain Confidence Levels

I gave a presentation at the Teradata Partners Conference last week on the fundamentals of collecting statistics, where I touched briefly on the rules behind confidence levels. I’m using this article to go into further depth, and offer you more examples of how these confidence levels come about. If you look at query plans much, I’m confident you’ll find something of interest in this article.

The Mandatory Background Section (Skip if you’re Teradata-Knowledgeable)

Confidence levels are determined by the optimizer at the time a query plan is built as an indication of how comfortable the optimizer is by the estimates it comes up with for a given step. High Confidence being the greatest comfort level, and No Confidence being the least.

Some steps don’t give you a confidence level because optimizing the step is a no-brainer. Below is an example of single-AMP using primary index access. Since access by primary index is automatic when the query provides a primary index value, no costing decisions need to be made and confidence levels are immaterial.

Here is an example of an all-AMP access step from a query plan that illustrate a confidence level typical of an all-AMP step:

EXPLAIN 
SELECT *
FROM customer
WHERE c_nationkey = 12;

    3) We do an all-AMPs RETRIEVE step from CAB30G.customer by way of an

        all-rows scan with a condition of ("CAB30G.customer.C_NATIONKEY =

        12") into Spool 1 (group_amps), which is built locally on the AMPs.

        The size of Spool 1 is estimated with high confidence to be

        179,308 rows (20,620,420 bytes). The estimated time for this step

        is 0.53 seconds.

Here is an example of a primary index access step with a missing confidence level:

EXPLAIN
SELECT *
FROM customer
WHERE c_custkey = 73454;

    1) First, we do a single-AMP RETRIEVE step from CAB30G.customer by

        way of the unique primary index "CAB30G.customer.C_CUSTKEY =

        73454" with no residual conditions. The estimated time for this

        step is 0.00 seconds. 

The Confidence Levels

There are 4 confidence levels for the optimizer to select from. The following graphic illustrates the pre-requisites for receiving each of these levels:

 

High Confidence

As shown above, primary index statistics must have been collected on the table being accessed in the step in order to receive High Confidence for that step. Further, there must only be a single predicate (single or multicolumn) with statistics collected on that predicate. There are two example examples below that illustrate this requirement.

For the two following examples, statistics were collected on the following:

  • T_Txnkey (primary index)
  • T_Quantity (column)
  • T_Shipdate (column)

The following explain shows that High Confidence results when primary index stats have been collected AND there is a single predicate:

EXPLAIN
SELECT * FROM txn
WHERE t_quantity = 5;

     3) We do an all-AMPs RETRIEVE…with a condition of ("CAB.txn.

        T_QUANTITY = 5.00") into Spool 1…The size of Spool 1 is 

        estimated with high confidence to be 199,878 rows… 

This next example illustrates that even with statistics collected on all referenced columns, Low Confidence is returned if there is more than one predicate on the table:

EXPLAIN
SELECT * FROM txn
WHERE t_quantity = 5
AND t_shipdate = 980101

     3) We do an all-AMPs RETRIEVE … with a condition of

        ("(CAB.txn.T_SHIPDATE = DATE '1998-01-01') AND

        (CAB.txn.T_QUANTITY = 5.00)") into Spool 1…The size of

        Spool 1 is estimated with low confidence to be 3,106 rows. . . 

If multicolumn stats had been collected on (T_Quantity,T_Shipdate), then the step would have qualified for High Confidence, since a multicolumn statistic is considered a single predicate. Note that for a multicolumn statistics to be used by the optimizer, all participating columns must be expressed in equality conditions within the query.

Low Confidence

For the vast majority of cases, the optimizer will cost a Low Confidence step the same as a High Confidence step. So there is little, if any, value in going to extreme lengths to achieve High Confidence steps. By the way, statistics are considered “collected” whether they are fully collected are or you have collected them taking advantage of the USING SAMPLE option. I just now ran the explain of the first query in this article after re-collecting statistics on c_nationkey specifying USING SAMPLE with a 2% sampling rate. I still received High Confidence for the step.

Low Confidence and Join Steps

Practically all join steps will give you Low Confidence. This is because the optimizer is not aware of what the match will be between the tables being joined. Below is a graphic that illustrates this point.

 

There is an exception to this. Under special circumstances a join step can receive High Confidence. The requirement is that a unique primary index is on one side of the join with stats collected, and there is a predicate in the WHERE clause that passes a single value for this unique primary index. An example follows:

EXPLAIN
SELECT * FROM txn,part
WHERE t_partkey = p_partkey <== UPI with stats, 1 row guaranteed
AND p_partkey = 88;

    3) We do an all-AMPs RETRIEVE step from CAB.txn by way of index # 4

        "CAB.txn.T_PARTKEY = 88". . . Spool 2 is estimated with high confidence

        to be 3 rows. . .

    4) We do a single-AMP JOIN step from CAB.part by way of the

        unique primary index "CAB.part.P_PARTKEY = 88“. . .joined to

        Spool 2 (Last Use). . . The size of Spool 1 is estimated with

        high confidence to be 3 rows. . .

Low Confidence and NUSI Selection with No Stats

If selection on the table is by NUSI, even though there are no statistics collected on the NUSI, Low Confidence will result . However, there must only be a single NUSI used as a predicate. If there are multiple NUSI’s referenced, No Confidence is returned. Below are two examples that illustrate this:

EXPLAIN
SELECT * FROM parttbl
WHERE p_size = 5 ; <== NUSI but no statistics

    3) We do an all-AMPs RETRIEVE step. . . by way of

        index # 4 "CAB.parttbl.P_SIZE = 5" . . . Spool 1 is estimated with low 

        confidence
to be 103,487 rows. 

EXPLAIN SELECT * FROM parttbl
WHERE p_size = 5 <== NUSI but no statistics
AND p_type NE 'small plated tin‘; <== NUSI but no statistics

     3) We do an all-AMPs RETRIEVE … by way of

        index # 4 "CAB.parttbl.P_SIZE = 5" with a residual condition

        of ("CAB.parttbl.P_TYPE <> 'small plated tin'") …

        Spool 1 is estimated with no confidence to be 92,342 rows .

The reasoning here is that random AMP samples provide basic demographic information about all NUSIs on a table. Since that information is available even without statistics collection on the NUSI, a Low Confidence is acceptable for such single NUSI access.

Trickle-Down Effect of No Confidence

If you’ve got complex, multi-step plans, chances are when you look at the steps at the bottom of the plan, most of them display No Confidence. One contributing factor is that lower confidence levels coming from previous steps will dominate the confidence level of the steps they feeds into.

Below is a query and its explain that illustrate this point:

In this example, full statistics on all indexes and referenced columns have been collected on both the Lineitem and the Parttbl tables. However no statistics at all have been collected on the Supplier table.

The following table illustrates the confidence levels at each step of this 3 table join.  Note that even though the Parttbl has all possible stats collected, the step that joins the Parttbl to the spool created by the join of the Lineitem and Supplier table shows No Confidence.  That is because the step feeding into it has No Confidence because no stats had been collected on Supplier.

 

Here is the entire explain from which information in the above table was drawn.

EXPLAIN SELECT * 
FROM lineitem,parttbl,supplier
WHERE p_partkey = l_partkey
AND s_suppkey = l_suppkey
AND p_size = 3
AND l_shipdate = 950101
AND s_nationkey = 6;

1) First, we lock a distinct CAB."pseudo table" for read on a

    RowHash to prevent global deadlock for CAB.supplier.

2) Next, we lock a distinct CAB."pseudo table" for read on a

    RowHash to prevent global deadlock for CAB.parttbl.

3) We lock a distinct CAB."pseudo table" for read on a RowHash to

    prevent global deadlock for CAB.lineitem.

4) We lock CAB.supplier for read, we lock CAB.parttbl for

    read, and we lock CAB.lineitem for read.

5) We do an all-AMPs RETRIEVE step from CAB.lineitem by way of an

    all-rows scan with a condition of ("CAB.lineitem.L_SHIPDATE =

    DATE '1995-01-01'") into Spool 2 (all_amps) (compressed columns

    allowed), which is redistributed by the hash code of (

    CAB.lineitem.L_SUPPKEY) to all AMPs. Then we do a SORT to

    order Spool 2 by row hash. The size of Spool 2 is estimated with

    high confidence to be 74,730 rows (9,415,980 bytes). The

    estimated time for this step is 16.61 seconds.

6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a

    RowHash match scan, which is joined to CAB.supplier by way of

    a RowHash match scan with a condition of (

    "CAB.supplier.S_NATIONKEY = 6"). Spool 2 and CAB.supplier

    are joined using a merge join, with a join condition of (

    "CAB.supplier.S_SUPPKEY = L_SUPPKEY"). The result goes into

    Spool 3 (all_amps) (compressed columns allowed), which is

    redistributed by the hash code of (CAB.lineitem.L_PARTKEY) to

    all AMPs. Then we do a SORT to order Spool 3 by row hash. The

    size of Spool 3 is estimated with no confidence to be 13,105 rows

    (2,738,945 bytes). The estimated time for this step is 0.06

    seconds.

7) We do an all-AMPs JOIN step from CAB.parttbl by way of a

    RowHash match scan with a condition of ("CAB.parttbl.P_SIZE =

3"), which is joined to Spool 3 (Last Use) by way of a RowHash

    match scan. CAB.parttbl and Spool 3 are joined using a merge

    join, with a join condition of ("CAB.parttbl.P_PARTKEY =

    L_PARTKEY"). The result goes into Spool 1 (group_amps), which is

    built locally on the AMPs. The size of Spool 1 is estimated with

    no confidence to be 13,105 rows (3,826,660 bytes). The estimated

    time for this step is 0.47 seconds.

8) Finally, we send out an END TRANSACTION step to all AMPs involved

    in processing the request.

Conclusion

Confidence Levels are associated to query plan steps in an effort to assess the confidence the optimizer has in the accuracy of the demographic information available for thatstep.  For the most part, High Confidence and Low Confidence steps will be treated the same by the optimizer when it comes to making plan decisions.  Even with good statistics collected, there will be cases where Low Confidence is delivered, especially for steps near the bottom of a complex query plan.

6 REPLIES
Enthusiast

Re: Can We Speak Confidentially? Exposing Explain Confidence Levels

Dear Carrie

Allow me to be bit selfish in asking you an off-the-track but a related question.

While collecting statistics on a table,I got "out of spool space " error,so much so that it resulted in mini cyl-pack alert. I took off the session but what's the solution for such a situation .Why I got spool error while collecting stats?
Does that mean I cannot collect stats even on that table?
Looking for your valued insight.
Teradata Employee

Re: Can We Speak Confidentially? Exposing Explain Confidence Levels

Statistics collection has always been resource-intensive. A full table scan is done on the table and there are several aggregate operations performed for each statistic in order to calculate this type of information:
Distinct values
Maximum rows/value
Rows/null

These aggregations involve spool. Running out of spool space may be a result of collecting statistics on columns/indexes whose values are skewed, or just not having a large enough spool allocation. (You could run a query/macro to calculate PeakSpool to determine if skew is a factor.)

You could increase the spool allocation for the collect statistics user, to make sure enough spool space is always available. However, if you are getting mini-cylpacks while collecting stats, that is a sign that there is not enough physical space to accommodate the work being done. You could eliminate old or unwanted tables as a temporary fix, run the packdisk utility more regularly, or longer-term consider adding more disk drives.

It is important to address the mini-cylpack problem, as getting mini-cylpacks will cause the statistics collection process (and other work) to run longer.

Enthusiast

Re: Can We Speak Confidentially? Exposing Explain Confidence Levels

Many thanks for the direction...
Teradata Employee

Re: Can We Speak Confidentially? Exposing Explain Confidence Levels

Hello Carrie,

Could you please explain what does the confidence levels mean by the
recommended statistics (enabled by diagnostic helpstats ... ).
Enthusiast

Re: Can We Speak Confidentially? Exposing Explain Confidence Levels

Hi Carrie,
Three simple questions:
Is confidence level also dependent on sample rate. Most databases, only sample table data to generate statistics. This is especially useful for a large table. If it is a small table, statistics may be accurate.
If I am looking for a non indexed & non partitioned value, only way Teradata gets that data is by full table scans. Does Teradata indicate statistics with high confidence, since it knows that FTS is needed in this case and it knows how long that takes.
I guess time that explain plan indicates is dependent on load on the system at that point in time. In other words, for the same SQL, explain plan differs in its estimation based on what resources are available. How does this affect AJI that is getting refreshed, which may be used by SQL.
Fan

Re: Can We Speak Confidentially? Exposing Explain Confidence Levels

Hi Iam Raghavender from India,
If anyone has a script generator which will automatically generate an mload, fload, or fexp generic template, can you please send me