I gave a presentation at the Teradata Partners Conference last week on the fundamentals of collecting statistics, where I touched briefly on the rules behind confidence levels. I’m using this article to go into further depth, and offer you more examples of how these confidence levels come about. If you look at query plans much, I’m confident you’ll find something of interest in this article.
Confidence levels are determined by the optimizer at the time a query plan is built as an indication of how comfortable the optimizer is by the estimates it comes up with for a given step. High Confidence being the greatest comfort level, and No Confidence being the least.
Some steps don’t give you a confidence level because optimizing the step is a no-brainer. Below is an example of single-AMP using primary index access. Since access by primary index is automatic when the query provides a primary index value, no costing decisions need to be made and confidence levels are immaterial.
Here is an example of an all-AMP access step from a query plan that illustrate a confidence level typical of an all-AMP step:
WHERE c_nationkey = 12;
3) We do an all-AMPs RETRIEVE step from CAB30G.customer by way of an
all-rows scan with a condition of ("CAB30G.customer.C_NATIONKEY =
12") into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be
179,308 rows (20,620,420 bytes). The estimated time for this step
is 0.53 seconds.
Here is an example of a primary index access step with a missing confidence level:
WHERE c_custkey = 73454;
1) First, we do a single-AMP RETRIEVE step from CAB30G.customer by
way of the unique primary index "CAB30G.customer.C_CUSTKEY =
73454" with no residual conditions. The estimated time for this
step is 0.00 seconds.
There are 4 confidence levels for the optimizer to select from. The following graphic illustrates the pre-requisites for receiving each of these levels:
As shown above, primary index statistics must have been collected on the table being accessed in the step in order to receive High Confidence for that step. Further, there must only be a single predicate (single or multicolumn) with statistics collected on that predicate. There are two example examples below that illustrate this requirement.
For the two following examples, statistics were collected on the following:
The following explain shows that High Confidence results when primary index stats have been collected AND there is a single predicate:
SELECT * FROM txn
WHERE t_quantity = 5;
3) We do an all-AMPs RETRIEVE…with a condition of ("CAB.txn.
T_QUANTITY = 5.00") into Spool 1…The size of Spool 1 is
estimated with high confidence to be 199,878 rows…
This next example illustrates that even with statistics collected on all referenced columns, Low Confidence is returned if there is more than one predicate on the table:
SELECT * FROM txn
WHERE t_quantity = 5
AND t_shipdate = 980101
3) We do an all-AMPs RETRIEVE … with a condition of
("(CAB.txn.T_SHIPDATE = DATE '1998-01-01') AND
(CAB.txn.T_QUANTITY = 5.00)") into Spool 1…The size of
Spool 1 is estimated with low confidence to be 3,106 rows. . .
If multicolumn stats had been collected on (T_Quantity,T_Shipdate), then the step would have qualified for High Confidence, since a multicolumn statistic is considered a single predicate. Note that for a multicolumn statistics to be used by the optimizer, all participating columns must be expressed in equality conditions within the query.
For the vast majority of cases, the optimizer will cost a Low Confidence step the same as a High Confidence step. So there is little, if any, value in going to extreme lengths to achieve High Confidence steps. By the way, statistics are considered “collected” whether they are fully collected are or you have collected them taking advantage of the USING SAMPLE option. I just now ran the explain of the first query in this article after re-collecting statistics on c_nationkey specifying USING SAMPLE with a 2% sampling rate. I still received High Confidence for the step.
Practically all join steps will give you Low Confidence. This is because the optimizer is not aware of what the match will be between the tables being joined. Below is a graphic that illustrates this point.
There is an exception to this. Under special circumstances a join step can receive High Confidence. The requirement is that a unique primary index is on one side of the join with stats collected, and there is a predicate in the WHERE clause that passes a single value for this unique primary index. An example follows:
SELECT * FROM txn,part
WHERE t_partkey = p_partkey <== UPI with stats, 1 row guaranteed
AND p_partkey = 88;
3) We do an all-AMPs RETRIEVE step from CAB.txn by way of index # 4
"CAB.txn.T_PARTKEY = 88". . . Spool 2 is estimated with high confidence
to be 3 rows. . .
4) We do a single-AMP JOIN step from CAB.part by way of the
unique primary index "CAB.part.P_PARTKEY = 88“. . .joined to
Spool 2 (Last Use). . . The size of Spool 1 is estimated with
high confidence to be 3 rows. . .
If selection on the table is by NUSI, even though there are no statistics collected on the NUSI, Low Confidence will result . However, there must only be a single NUSI used as a predicate. If there are multiple NUSI’s referenced, No Confidence is returned. Below are two examples that illustrate this:
SELECT * FROM parttbl
WHERE p_size = 5 ; <== NUSI but no statistics
3) We do an all-AMPs RETRIEVE step. . . by way of
index # 4 "CAB.parttbl.P_SIZE = 5" . . . Spool 1 is estimated with low
confidence to be 103,487 rows.
EXPLAIN SELECT * FROM parttbl
WHERE p_size = 5 <== NUSI but no statistics
AND p_type NE 'small plated tin‘; <== NUSI but no statistics
3) We do an all-AMPs RETRIEVE … by way of
index # 4 "CAB.parttbl.P_SIZE = 5" with a residual condition
of ("CAB.parttbl.P_TYPE <> 'small plated tin'") …
Spool 1 is estimated with no confidence to be 92,342 rows .
The reasoning here is that random AMP samples provide basic demographic information about all NUSIs on a table. Since that information is available even without statistics collection on the NUSI, a Low Confidence is acceptable for such single NUSI access.
If you’ve got complex, multi-step plans, chances are when you look at the steps at the bottom of the plan, most of them display No Confidence. One contributing factor is that lower confidence levels coming from previous steps will dominate the confidence level of the steps they feeds into.
Below is a query and its explain that illustrate this point:
In this example, full statistics on all indexes and referenced columns have been collected on both the Lineitem and the Parttbl tables. However no statistics at all have been collected on the Supplier table.
The following table illustrates the confidence levels at each step of this 3 table join. Note that even though the Parttbl has all possible stats collected, the step that joins the Parttbl to the spool created by the join of the Lineitem and Supplier table shows No Confidence. That is because the step feeding into it has No Confidence because no stats had been collected on Supplier.
Here is the entire explain from which information in the above table was drawn.
EXPLAIN SELECT *
WHERE p_partkey = l_partkey
AND s_suppkey = l_suppkey
AND p_size = 3
AND l_shipdate = 950101
AND s_nationkey = 6;
1) First, we lock a distinct CAB."pseudo table" for read on a
RowHash to prevent global deadlock for CAB.supplier.
2) Next, we lock a distinct CAB."pseudo table" for read on a
RowHash to prevent global deadlock for CAB.parttbl.
3) We lock a distinct CAB."pseudo table" for read on a RowHash to
prevent global deadlock for CAB.lineitem.
4) We lock CAB.supplier for read, we lock CAB.parttbl for
read, and we lock CAB.lineitem for read.
5) We do an all-AMPs RETRIEVE step from CAB.lineitem by way of an
all-rows scan with a condition of ("CAB.lineitem.L_SHIPDATE =
DATE '1995-01-01'") into Spool 2 (all_amps) (compressed columns
allowed), which is redistributed by the hash code of (
CAB.lineitem.L_SUPPKEY) to all AMPs. Then we do a SORT to
order Spool 2 by row hash. The size of Spool 2 is estimated with
high confidence to be 74,730 rows (9,415,980 bytes). The
estimated time for this step is 16.61 seconds.
6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
RowHash match scan, which is joined to CAB.supplier by way of
a RowHash match scan with a condition of (
"CAB.supplier.S_NATIONKEY = 6"). Spool 2 and CAB.supplier
are joined using a merge join, with a join condition of (
"CAB.supplier.S_SUPPKEY = L_SUPPKEY"). The result goes into
Spool 3 (all_amps) (compressed columns allowed), which is
redistributed by the hash code of (CAB.lineitem.L_PARTKEY) to
all AMPs. Then we do a SORT to order Spool 3 by row hash. The
size of Spool 3 is estimated with no confidence to be 13,105 rows
(2,738,945 bytes). The estimated time for this step is 0.06
7) We do an all-AMPs JOIN step from CAB.parttbl by way of a
RowHash match scan with a condition of ("CAB.parttbl.P_SIZE =
3"), which is joined to Spool 3 (Last Use) by way of a RowHash
match scan. CAB.parttbl and Spool 3 are joined using a merge
join, with a join condition of ("CAB.parttbl.P_PARTKEY =
L_PARTKEY"). The result goes into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
no confidence to be 13,105 rows (3,826,660 bytes). The estimated
time for this step is 0.47 seconds.
8) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
Confidence Levels are associated to query plan steps in an effort to assess the confidence the optimizer has in the accuracy of the demographic information available for thatstep. For the most part, High Confidence and Low Confidence steps will be treated the same by the optimizer when it comes to making plan decisions. Even with good statistics collected, there will be cases where Low Confidence is delivered, especially for steps near the bottom of a complex query plan.