Partition effectiveness

Database
Enthusiast

Partition effectiveness

All of our tables have effective start and end date columns (EFF_STRT_DT, EFF_END_DT).  We want to create date partitions on some of our larger tables, and our fist instinct is to create the partition on EFF_END_DT, because we set that date to '9999-21-31' to represent "active" rows in our data warehouse.

Most of the time, we will actually be doing a BETWEEN comparison similar to the following

select * from fpdw_bkup.pty_pi
where '2014-03-31' between EFF_STRT_DT and EFF_END_DT;

The table in the code snippet is defined with a partition statement of

PARTITION BY RANGE_N(EFF_END_DT  BETWEEN DATE '2012-12-31' AND DATE '9999-12-31' EACH INTERVAL '7' DAY , NO RANGE, UNKNOWN)


and the SQL in the snippet shows this in the EXPLAIN plan:

  3) We do an all-AMPs RETRIEVE step from 416678 partitions of

     fpdw_bkup.pty_pi with a condition of (

     "(fpdw_bkup.pty_pi.EFF_STRT_DT <= DATE '2014-03-31') AND

     (fpdw_bkup.pty_pi.EFF_END_DT >= DATE '2014-03-31')") into Spool 1

     (group_amps), which is built locally on the AMPs. 


What I'm trying to determine is whether the partition I had in mind is worth it, given that we don't often use equi-joins with these date columns.  Or is there a different way to do the partition or the SQL statement that would help me?

Thanks,

Robin

Tags (2)
8 REPLIES
Enthusiast

Re: Partition effectiveness


This is my opinion. I make partition to achieve speed and a goal: I will count the partitions. This is just example:

SELECT RANGE_N( EFF_END_DT  BETWEEN DATE '2012-12-31' AND DATE '9999-12-31' EACH INTERVAL '7' DAY , NO RANGE, UNKNOWN) partitions

COUNT(*) AS cnt,

FROM table1

GROUP BY partitions

;

Grouping older ranges with less finer say

PARTITION BY RANGE_N (order_date BETWEEN DATE '2012-12-31' 

AND DATE '2013-12-31'

EACH INTERVAL '3' MONTH,

DATE '2013-12-31'

AND DATE '2014-06-30'

EACH INTERVAL '7' DAY,

.........  

SELECT *

FROM table1

WHERE your_date BETWEEN DATE '.....aaaa-12-31...' AND DATE '.....bbbb-12-31';

Enthusiast

Re: Partition effectiveness

Thanks for your reply Raja, but I don't think that's what I'm looking for.

I guess I'm trying to get some assurance that if the optimizer uses a partition for access when it builds the EXPLAIN plan instead of an all-rows scan, I can trust it will be the most efficient access method, regardless of the number of partitions it uses for retrieval.

The large number of partitions I saw in the EXPLAIN took me by surprise, and that is why I'm asking about this.

Robin

Junior Contributor

Re: Partition effectiveness

Hi Robin,

you defined 416.741 partitions and even if there a PARTITION stats the optimizer must still include all possible partitions in the plan.

You should limit the number of partitions, e.g.

PARTITION BY RANGE_N(EFF_END_DT  BETWEEN DATE '2012-12-31' AND DATE '2030-12-31'EACH INTERVAL '7' DAY , DATE '9999-12-31' AND DATE '9999-12-31' EACH INTERVAL '1' DAY, NO RANGE, UNKNOWN)

The NO RANGE will also be included in the scan, this doesn't matter if the number of rows is small.

You might also include the max date in NO RANGE:

PARTITION BY RANGE_N(EFF_END_DT  BETWEEN DATE '2012-12-31' AND DATE '2030-12-31' EACH INTERVAL '7' DAY ,NO RANGE, UNKNOWN)

As a side effect you'll save a lot of disk space (6 bytes per rows * number of indices) as your current schema results in 8 byte partition numbers and reducing to less than 64K partitions only needs 2 bytes.

Enthusiast

Re: Partition effectiveness

Dieter,

I think I understand what you are getting at.  Here's another issue, though...that '9999-12-31' date is overwhelmingly the most frequently occurring date that we have, so of course when I was testing the partition for one of our large Accounts Receivable tables (where that was the date value for well over 80% of the rows), it made things worse for my query...bad enough to run out of spool.

So, this leads me to believe that the paritions that are defined have to be somewhat balanced...is that true?  Should I find another common JOIN column to combine with our date to define a multi-level partition?  Or is there another technique I should look at?

thanks,

Robin

Enthusiast

Re: Partition effectiveness

Robin,

Did you collect stats on the partition?

Enthusiast

Re: Partition effectiveness

Yep.

Teradata Employee

Re: Partition effectiveness

- If the optimizer chooses to use partitions as in your initial example, it will be less costly than a full file scan. There is no separate overhead for scanning individual partitions, think of it as a starting (and/or ending) position for a scan that then scans a subset of the table.

- Dieter pointed out the reason why your plan showed so many partitions. Generally it is not advised to define a huge number of empty partitions, especially when they will never be used. The large numbers of empty partitions interfere with query optimization and some maintenance operations on the table. It is better to be more specific in the partitioning expression like the version suggested by Dieter. Then your query will show a much lower number of partitions because all the empty ones won't be in between now and your "high date".

- Your partitions do not have to be balanced. No requirement for that. However, if the bulk of your data is in one or more partitions that are included in your condition, then that data needs to be processed to answer the query and you must have spool space for that data as it is being processed. Unless there is some way to further qualify that data, having spool space for it is unavoidable.

- Another strategy that some have used with a time based table like this, especially when the bulk of the queries are going to be accessing the "current" or "open" records is to define just two partitions. One has just the high date and the other has all other valid dates. This makes a two partition table that segregates the current rows from the historical rows and allows the queries that ask about just current rows to skip all the historical rows. However, the query then needs to be coded to ask for "= high date" rather than "> some date" to specifically access the current data partition.

- Or you could use Teradata Temporal....

Enthusiast

Re: Partition effectiveness

I didn't think about empty partitions!   I'll do some testing with Dieter's suggested syntax, and I'll also test your suggestion for current vs. historical.

Teradata temporal is a whole 'nother situation.

Thanks so much for the information.