Blog

The best minds from Teradata, our partners, and customers blog about relevant topics and features.

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

04-29-2010
10:40 AM

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

04-29-2010
10:40 AM

Collecting full statistics involves scanning the base table and performing a sort to compute the number of occurrences for each distinct value. For some users, the time and resources required to adequately collect statistics and keep them up-to-date is a challenge, particularly on large tables. Collecting statistics on a sample of the data reduces the resources required and the time to perform statistics collection. This blog gives you some tips on using sampled stats.

Sampled stats, like full stats, are only collected when an explicit command is issued. That COLLECT STATISTICS command must include the phrase USING SAMPLE. The output of this sampling is kept in the data dictionary, and histograms are built, in the same manner as when full collection is done. The optimizer uses the sampled stats just as though they were collected in the non-sampled manner. So it is important that you vaildate that the output of USING SAMPLE produces just as good plans as do full statistics.

When USING SAMPLE is specified, the system will automatically determine the correct percentage of the data to scan, which by default is 2% of the rows. The percent of the table that is read is carried in a DBS Control parameter called CollectStatsSample. The setting can be increased, and once increased, all statements that collect sampled statistics will use the new percentage value going forward.

For most columns, USING SAMPLE only reads the percent of the table that is specified in that DBS Control field, and skips the rest. It then makes caculations as to how many distinct values the full table has. Teradata 12 uses the original approach to sampling that came out in Teradata V2R5, while Teradata 13 uses a different approach, that I have found to be more accurate. If you decided not to use sampled stats in the earlier releases, because it was less accurate than you liked, give it another try when you get to Teradata 13.

Using the Teradata 12 sampling functionality, the best choices for USING SAMPLE are columns or indexes that are unique or nearly-unique. Nearly-unique refers to columns or combinations of columns which are over 95% unique.

If statistics were collected with USING SAMPLE you will be able to validate that by looking at the histogram that is produced. Below are a subset of the fields contained within Interval zero of a statistics histogram from Teradata 12, with full statistics collected and then with USING SAMPLE collected at 2%. Notice that the “Number of Unqiues” field is similar in both, making USING SAMPLE a good choice for this column. Also notice that the field “Sampled” contains a “1” when sampling was used.

You might want to use USING SAMPLE at 2% for your nearly-unique columns, but collect at a somewhat higher percentage, 20% or 30% for example, for columns with fewer distinct values. If you are not happy having all sampled statistics collected at the same percentage, there is a diagnostic command that allows you to change the sampling percent at the session level. This command will temporarily override what is in DBS control. Be aware, however, if you recollect stats at the table level at a later time, the sampling carried in DBS Control will be used automatically.

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=n" ON FOR SESSION;

Whatever number you place where 'n' is, that's your percent sample size. After stats are collected, I suggest you check the histogram in the "sampled percent" field for verification that this percent was used.

Here's a couple of examples I’ve used:

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=20" ON FOR SESSION;

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=50" ON FOR SESSION;

Whatever percent you collect sampled stats at, it’s a good idea validate that the number of distinct values that sampling produced is approximately the same as the number of distinct values with full statistics collected. You can do this by first collecting full statistics, followed by a HELP STATs command to see the number of distinct values that was produced. Remember that number. Then do your sampled stats collection and look at the HELP STATs output again, and compare the two.

And as with everything having to do with statistics, getting good query plans is the best validation when you make a change.

29 Comments

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.