All of our tables have effective start and end date columns (EFF_STRT_DT, EFF_END_DT). We want to create date partitions on some of our larger tables, and our fist instinct is to create the partition on EFF_END_DT, because we set that date to '9999-21-31' to represent "active" rows in our data warehouse.
Most of the time, we will actually be doing a BETWEEN comparison similar to the following
select * from fpdw_bkup.pty_pi
where '2014-03-31' between EFF_STRT_DT and EFF_END_DT;
The table in the code snippet is defined with a partition statement of
PARTITION BY RANGE_N(EFF_END_DT BETWEEN DATE '2012-12-31' AND DATE '9999-12-31' EACH INTERVAL '7' DAY , NO RANGE, UNKNOWN)
Thanks for your reply Raja, but I don't think that's what I'm looking for.
I guess I'm trying to get some assurance that if the optimizer uses a partition for access when it builds the EXPLAIN plan instead of an all-rows scan, I can trust it will be the most efficient access method, regardless of the number of partitions it uses for retrieval.
The large number of partitions I saw in the EXPLAIN took me by surprise, and that is why I'm asking about this.
you defined 416.741 partitions and even if there a PARTITION stats the optimizer must still include all possible partitions in the plan.
You should limit the number of partitions, e.g.
I think I understand what you are getting at. Here's another issue, though...that '9999-12-31' date is overwhelmingly the most frequently occurring date that we have, so of course when I was testing the partition for one of our large Accounts Receivable tables (where that was the date value for well over 80% of the rows), it made things worse for my query...bad enough to run out of spool.
So, this leads me to believe that the paritions that are defined have to be somewhat balanced...is that true? Should I find another common JOIN column to combine with our date to define a multi-level partition? Or is there another technique I should look at?
- If the optimizer chooses to use partitions as in your initial example, it will be less costly than a full file scan. There is no separate overhead for scanning individual partitions, think of it as a starting (and/or ending) position for a scan that then scans a subset of the table.
- Dieter pointed out the reason why your plan showed so many partitions. Generally it is not advised to define a huge number of empty partitions, especially when they will never be used. The large numbers of empty partitions interfere with query optimization and some maintenance operations on the table. It is better to be more specific in the partitioning expression like the version suggested by Dieter. Then your query will show a much lower number of partitions because all the empty ones won't be in between now and your "high date".
- Your partitions do not have to be balanced. No requirement for that. However, if the bulk of your data is in one or more partitions that are included in your condition, then that data needs to be processed to answer the query and you must have spool space for that data as it is being processed. Unless there is some way to further qualify that data, having spool space for it is unavoidable.
- Another strategy that some have used with a time based table like this, especially when the bulk of the queries are going to be accessing the "current" or "open" records is to define just two partitions. One has just the high date and the other has all other valid dates. This makes a two partition table that segregates the current rows from the historical rows and allows the queries that ask about just current rows to skip all the historical rows. However, the query then needs to be coded to ask for "= high date" rather than "> some date" to specifically access the current data partition.
- Or you could use Teradata Temporal....
I didn't think about empty partitions! I'll do some testing with Dieter's suggested syntax, and I'll also test your suggestion for current vs. historical.
Teradata temporal is a whole 'nother situation.
Thanks so much for the information.