Collecting Stats on PARTITION for NON-PARTITIONED tables

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Please  note:  This posting is relevant only to software releases prior to Teradata Database 14.0.  Ignore this recommendation if you are on 14.0 or greater software.

You may have already heard the recommendation on collecting statistics on the system-derived column PARTITION on partitioned primary index (PPI) tables. What I want to address in this blog may sound strange, but there are times when collecting statistics on PARTITION makes sense for a NON-PARTITIONED primary index (NPPI) table as well.

I’ve supplied some examples below of why you might want to do that, and what you’ll see if you do. By the way, the NPPI table is treated as if it is just one big partition with a partition number of 0.

Here’s a couple of reasons to consider collecting PARTITION stats on NPPI tables:

Reason 1:

Prior to Teradata 13.0 (which improves the SELECT COUNT(*) efficiency), collecting stats on PARTITION and then accessing the histogram detail is often faster than issuing a SELECT COUNT(*) from the non-PPI table. This could be helpful if you are in a hurry and want to determine the current number of rows in a table, and it’s a big table. Here’s an example that illustrates this point from my Teradata 12 test system:

As shown above, depending on the number of rows in the table, collecting statistics on PARTITION can be quite a bit faster than doing the SELECT COUNT(*).

Of course, just collecting statistics does not in and of itself give you the table’s row count. You have to look at the actual histogram counts, by issuing this statement:

HELP STATS item COLUMN PARTITION;

You will get two rows back in the histogram. Here are those two rows (with just a subset of the columns illustrated) that you will get back after collecting statistics on PARTITION on the Item table:

The field labeled “Number of Rows” provides an accurate row count of this table at the time the statistics were collected.

Here is the SELECT COUNT(*) result:

select count(*) from item;

Count(*)

179998372

Reason 2: 

Starting in Teradata 12.0.3, Teradata 13.10 (and possibly Teradata 13.0.1) there are some new enhancements in the optimizer around the determination of stale statistics. As in previous releases, statistics extrapolation will happen only if stale statistics have been determined, plus the table is not a small table, plus there is no significant skew present.

With this enhancement, better approaches to table growth detection logic have evolved. As part of that enhancement, the table row count in the PARTITION statistic histogram can be conditionally used to help determine if the table has grown, and by how much. PARTITION stats come into play only if PARTITION statistics have actually been collected on the table and only if they offer the most recent row count demographic. This means that frequent recollection on system-derived column PARTITION for growing tables can help the database provide better statistics extrapolation, while still relying on single-AMP random AMP sampling.

Conclusion

Since collecting statistics on PARTITION is so quick, even for larger tables, collecting on PARTITION is a good technique to provide the optimizer with additional information that can be helpful in comparing against random AMP samples to determine if statistics are stale. This collection could be done several times a day, if that is helpful.

And in case you’ve forgotten why collecting stats on PARTITION is so fast, let me remind you. The cylinder index of the PPI table is scanned, instead of the table's data blocks and rows. This process of reading the cylinder indexes may involve accessing a few data blocks when partitions span blocks, but otherwise only cylinder index structures are examined.

11 Comments
Enthusiast
Thanks for such a wonderful tip Carrie...!!
Enthusiast
Carrie, suppose I have one table with Partition defined on Date Column called log_date and this is also part of UPI. I tried to collect stats on both system-derived column PARTITION and partitioned column log_date itself and I was successful. Do we really need both stats, if not which one is more significant.
Enthusiast
Hello Carrie,
I have been looking for answers to a basic question on table partitioning in Teradata. Can you please shed more light. Here it is -

If we do Range Partitioning on a table based on Numeric values and then different jobs load data corresponding to different partition ranges in parallel - then do we get into a resource contention scenario OR do all parallel running jobs hit specific partition and only acquire lock on a particular partition. In short - the question is - while loading data into a partitioned table [in this case Range Partition], do we acquire a lock at the partition level and multiple programs can run in parallel acquiring lock at individual partitions OR the lock is acquired at whole table level and the parallel jobs get into a contention mode.

Ex: An error table where each program loads specific error information that is encountered based on error code. If the table is range partitioned on error codes of range of 1000s, will individual programs when run in parallel hit their specific partition acquiring only lock to it or they will acquire complete table lock and release them for subsequent jobs when error load is done.

Please advice.
Teradata Employee
Hi Bodhi,

When you say "different load jobs" do you mean different TPump jobs? Locking conflicts will be different depending on the type of load approach you are intending to use.

Only TPump (or other homegrown row-at-a-time processing jobs) will allow multiple jobs to load into the same table at the same time, since TPump does row hash level locking. MultiLoad and FastLoad use table level locking.

There is no partition-level locking in Teradata, so if you want to run multiple TPump jobs in parallel to load the same table the potential for locking contention will NOT be worse because the table is partitioned.

Under some conditions, locking contention will be less when loading with TPump. This would be the case if you had NUPIs defined on the table being loaded, and you were loading multiple rows with the same NUPI value across different sessions or different load jobs at the same time. With a non-partitioned table, all NUPIs would share the same row hash, and could conflict. But with a partitioned table, it might be that all rows with the same PI value had different partitioning column values and ended up in different partitions. Then they would not conflict with each other, because their row hash values (which include the partition) would be different.

Thanks, -Carrie
Enthusiast
Thanks for your detailed reply Carrie.

We are not using the TPump functionality for our loads. We are performing loads onto the table in question via Informatica using Relational Connection that i believe shall perform a row by row operation. We do have a NUPI defined on this table but the partition isn't based on the NUPI - but is on a different numeric column.

By "different jobs", i meant the different Informatica jobs that are trying to access this table for row insertion purposes at the same time. This is because the table is used across warehouse for loading error related data and when different Informatica sessions try locking the table at same time - they get into contention/deadlock. We are considering partitioning of table based on the numeric value of error codes - but in this case - as i understand - since the jobs are using relational connection - they might acquire a WRITE LOCK on the table and thus the table might not be available for access of other subsequent jobs - hitting it at same time. Hence, partitioning might not be a solution here.

Please let me know your thoughts on this.

Thanks carrie.
Enthusiast
Here are additional details -

I partitioned the target table [DBNM.ERROR_TARGET] based on the numeric partition key of error code and collected Explain plan of a Insert query that inserts records into target table from another similar table [DBNM.ERROR_SOURCE].

1) First, we lock a distinct DBNM."pseudo table" for write on a
RowHash to prevent global deadlock for DBNM.ERROR_TARGET.
2) Next, we lock DBNM.ERROR_TARGET for write, and we lock
DBNM.ERROR_SOURCE in view DBVW.ERROR_SOURCE for access.

We see that the lock on Partitioned target table is a "WRITE LOCK" based on "ROW HASH". In this case, i believe the requestor code shall get exclusive right to the locked resource - except for allowing Select operations. Also, since it is ROW HASH in nature, it shall apply lock on all rows with similar hash code. The ROW HASH shall be a result of the NUPI value - which is not a selection of Partition Key here. Thus, if a parallel Informatica program tried loading the data into the target table - will it get into a possibility of contention?
Teradata Employee
The target table receives a table level lock, based on your explain information above. Pseudo table lock which is applied at the row hash level is something different, it not a base table lock. For an explanation of pseudo table locking go to the Teradata manuals. You can find a good explanation in the Release 12 Performance Management Guide, p. 177.

If a load job places a table level write lock, only one such job will be able to run at one time. Partioning the table will not allow more jobs to run at once (each in a different partition), since there is no ability to lock a partition. You will still have the table-level lock contention even if you introduce partitioning into the table.
Enthusiast
I get it Carrie. Thanks a lot for clearing out my doubt! Thanks again..
Carrie

Is there a reason from an internals perspective why we couldn't get the same thing from any column in a table where stats already exist (if collected daily and loaded by batch once daily) by doing a help stats on a column and looking at the Number of Rows column returned for an intervals? I'm just wondering if there is a difference between what the partition histogram would provide versus the column level if collected at the same frequency.

Is there something I am missing here?

I would also think that the usefulness of either method would be dependant upon the regularity with which the stats are collected.

I could see where if we were doing 20% sample collection on PI's of tables and a full on the partition for non ppi tables daily, then I would think this would become very useful.

Thanks

Kevin
Teradata Employee
Hi Kevin,

You bring up a good point.

As mentioned above, from 12.0.3 onwards PARTITION statistics on a table (if they exist) are used to estimate table cardinality for good extrapolation, even for tables that are non-PPI. While it is possible to get the same information from the latest histogram, it could take lot of query parsing time to probe all the histograms to figure out which one is latest. In order to avoid this overhead, it was determined that the optimizer would use primary index and/or PARTITION histograms to estimate the table cardinality. The fact that collecting statistics on PARTITION is generally very fast is an additional factor in favor of this approach.
Thanks Carrie. I was just making sure I understood the concept :o)

At our current site, we will be implementing a stats collection routine to do sample statistics based up on some recommendations you have made elsewhere and I was trying to see how this approach might fit.