Stats will help in pointing out the data demographics of the tables involved. This will ensure that the least data is brought in for processing
for example : if there is a join on a column what and there is a qualifier existing on it as well ie where clause to filter data based on the same column, the stats will determine the table on which the stats are applied based on where the maximum data will be eliminated.
similarly if a copy of a table is to be created across all amps for joining, stats will help in choosing the table to be replicated based on the data demographics.
Thankx for the reply Anadi... But i am not exactly looking for that My question was what really is use of stats collected on set selection non-index column ,even in case of JOIN it will be used to eliminate rows that will participate in JOIN but that is also related to estimation of spool file size.
I don't think we have stats based on per AMP , so that we can have a 1 AMP operation sort of thing if optimizer knows that this particular value exists on which AMP......
So why do we really need stats on non-index column having predicate....
Prakhar, may I refer you to an article by Carrie Ballinger on the Teradata Developer Network:
The reason for collecting single column and multi-column stats where appropriate is to provide the optimizer with the most accurate histogram of the data being used to satisfy a particular query. (This includes join columns, WHERE conditions, etc.) Statistics alone can take a poorly performing query plan and drastically improve the query plan without the need and overhead associated with a NUSI.
If you haven't done so already, consider using the following DIAGNOSTIC hint:
DIAGNOSTIC HELPSTATS ON FOR SESSION;
Then take a few queries that are performing below your expectations and run an EXPLAIN. At the end of the EXPLAIN will be suggestions on which statistics may help the optimizer improve the query plan. Focus your attention on those that should yield a high confidence. Capture the query plan before and after your collect the stats to see what impact was made. It should be noted there is a bug in Teradata 12 (depending on your current patch release) where the recommended stats to be collected are shown even if the stats exist. It should also be noted that you may not need to collect all the recommended stats in order to yield the best possible query plan.
Once you collect stats, you also need to maintain them. Stale statistics can be dangerous if the underlying data changes in volume or the demographics of the data shifts drastically.
Hi Rob, Thankx for your reply... As you have said "The reason for collecting single column and multi-column stats where appropriate is to provide the optimizer with the most accurate histogram of the data being used to satisfy a particular query" But ultimately even if correct stats are presented for a query like select * from table where column=300 Full table scan will happen...... Using stats optimizer can find out how many rows are expected to have value as 300 and build estimate spool size... Apart from that is there anything else that helps optimizer in this case...
For a simple query, you are correct that stats on non-index columns merely give you a better cardinality estimate (and don't change the query plan). But it may be important to have that more accurate cardinality if you are doing some initial workload management classification based on the estimated query cost.
For more complex queries, an improved estimate for a subquery may be important in coming up with the optimal plan for the outer query. Or if the non-indexed column is used as a join predicate, it may be important for the optimizer to know the distribution & skew of values (i.e. the histogram or demographics) to pick a reasonable "join geography" / hash distribution for spool.