Partitioning Data in a Rolling 12 month window

Analytics
Enthusiast

Partitioning Data in a Rolling 12 month window

I have a million + row data set of dates that I need to partition into rollowing 12 month windows. The rollowing 12 months can be different for each truck. Can this be done in an analytics window? If not can it be done by recursion? Can you provide an example?

 

Here is what the data looks like:

 

TruckStsSts_Dt
AAG191D19/02/2016
AAG191D8/05/2016
AAG191D1/09/2016
AAG191D2/03/2017
AAG191D12/06/2017
AAG191D20/08/2017
AAG191D22/02/2018

 

Here is what I am trying to get to:

 

TruckStsSts_DtEnd Date
AAG191D19/02/201619/02/2017
AAG191D8/05/201619/02/2017
AAG191D1/09/201619/02/2017
AAG191D2/03/20172/03/2018
AAG191D12/06/20172/03/2018
AAG191D20/08/20172/03/2018
AAG191D22/02/20182/03/2018

Accepted Solutions
Junior Supporter

Re: Partitioning Data in a Rolling 12 month window

I thought I'd reply here rather than on your post with the longer set of code.  The condition with end date minus sts_dt <= 366 is actually operating on more than 1 12 month window.  And my results did roll forward to the following year.  My results had matched your sample results exactly (except for the date formats.)  So I thought I'd repost with some added detail in case I wasn't totally clear.

 

I post my results first, then the sql with comments inserted about each of the steps.  I did add truck to one of the joins - it surely better for the optimizer when looking at your actual data - your sample all had the same truck id.

So these are my results from the tables I had created in the earlier post:

truck sts sts_dt end_date

AAG191D2016-02-192017-02-19
AAG191D2016-05-082017-02-19
AAG191D2016-09-012017-02-19
AAG191D2017-03-022018-03-02
AAG191D2017-06-122018-03-02
AAG191D2017-08-202018-03-02
AAG191D2018-02-222018-03-02

Which seems to have worked for all 3 status years.

So here's my commented code:

 

select truck
     ,sts
     ,sts_dt
     ,end_date
from (select x.truck
           ,x.sts
           ,x.sts_dt
           ,b.end_date
           ,b.end_date - x.sts_dt as diff
        from rs150000.truck_dates as x
       -- a comma without the keyword 'join' is alternate syntax
       -- and requires I use 'where' instead of 'on' for the join
       ,(select truck
               ,add_months(first_dt_yr,12) as end_date

        -- in derived table b first_dt_yr has taken 3 distinct values from dervied a
         -- 12 months are added to those 3 different values and the name changed to end_date
         from (select truck
                     ,sts
                     ,extract(year from sts_dt) as sts_yr
                     ,min(sts_dt) as first_dt_yr
                 from rs150000.truck_dates
                 group by 1,2,3 ) as a
              -- derived table a generates the minimum date per year for each year
              -- so with sts_dt covering 3 different years, there are 3 distinct values for first_dt_yr
         ) as b
         where b.truck = x.truck
        -- even though all sample truck IDs are the same, the truck condition is definitely wiser
        -- we still get a partial cartesian where all end dates will appear with all status dates
        -- which is why the date tests are needed
        and b.end_date - x.sts_dt <= 366
        -- this subtraction is actually a test on mulitple 12 month ranges
        -- with negatives removed in the next condition, only 1 sts_dt has 2 end dates
        -- the 2018 date still returns 2 rows, one with a difference of 8 days & one with 365
        -- the extra row for 2018 gets removed with the qualify in the outer condition
        and b.end_date - x.sts_dt > 0
        -- because of the partial cartesian some end dates fall before the status date
        -- the greater than 0 condition removes those  ) as c
qualify rank() over(partition by sts_dt order by diff) = 1
order by 3,4;

 

I'll also show you some results with some of the above conditions removed:

This is the output with the date condtions removed:

truck sts sts_dt end_date diff

AAG191D2016-02-192017-02-19366
AAG191D2016-05-082017-02-19287
AAG191D2016-09-012017-02-19171
AAG191D2017-03-022018-03-02365
AAG191D2017-03-022017-02-19-11
AAG191D2017-06-122018-03-02263
AAG191D2017-06-122017-02-19-113
AAG191D2017-08-202018-03-02194
AAG191D2017-08-202017-02-19-182
AAG191D2018-02-222019-02-22365
AAG191D2018-02-222018-03-028
AAG191D2018-02-222017-02-19-368

Here's the output with the >0 condtion added: 

 

truck sts sts_dt end_date diff

AAG191D2016-02-192017-02-19366
AAG191D2016-05-082017-02-19287
AAG191D2016-09-012017-02-19171
AAG191D2017-03-022018-03-02365
AAG191D2017-06-122018-03-02263
AAG191D2017-08-202018-03-02194
AAG191D2018-02-222018-03-028
AAG191D2018-02-222019-02-22365

So the 2019 end date for the 2018 status ends up getting removed with the qualify.

I hope this works for you.

 

1 ACCEPTED SOLUTION
11 REPLIES
Junior Supporter

Re: Partitioning Data in a Rolling 12 month window

I'm not sure if this will get you what you want - your rolling 12 month description doesn't quite line up with your example.  I'll show you a possible solution based on what I think you want and then ask some follow ups.  I first created a table and loaded your sample data.  My solution is actually pretty simple but we'll see.

create table rs150000.truck_dates

truck varchar(6),

   sts char(1),

   sts_dt date FORMAT 'DD/MM/YYYY'

 ) primary index(truck);

insert into rs150000.truck_dates values('AAG191','D','19/02/2016');

insert into rs150000.truck_dates values('AAG191','D','08/05/2016');

insert into rs150000.truck_dates values('AAG191','D','01/09/2016');

insert into rs150000.truck_dates values('AAG191','D','02/03/2017');

insert into rs150000.truck_dates values('AAG191','D','12/06/2017');

insert into rs150000.truck_dates values('AAG191','D','19/02/2016');

insert into rs150000.truck_dates values('AAG191','D','20/08/2017');

insert into rs150000.truck_dates values('AAG191','D','22/02/2018');

 

select truck

           ,sts

          ,sts_dt

          ,add_months(sts_dt,12) as end_date

from rs150000.truck_dates;

And the results:

truck      sts   sts_dt        end_date

------       ---   ----------       ----------

AAG191 D 2016-02-19 2017-02-19

AAG191 D 2017-06-12 2018-06-12

AAG191 D 2017-08-20 2018-08-20

AAG191 D 2018-02-22 2019-02-22

AAG191 D 2016-05-08 2017-05-08

AAG191 D 2016-09-01 2017-09-01

AAG191 D 2017-03-02 2018-03-02

I don't have the same date formats you're showing, but I suspect you have a different system default.

My results themselves, however, are a little off from your example, but your example either doesn't actually reflect a rolling 12 month window or I'm misinterpreting something.

Your 1st row covers February 19, 2016 to February 19, 2017.  Your next 2 rows both begin with dates in 2016 and end with February 19, 2017.  If your 12 month roll is keying off the earliest date within a calendar year and then calculating 12 month off of that, then your example makes sense. 

If my example works, great, but if it's now quite what you need, can clarify what you mean by the rolling 12 month?  If it's what I think it might be, your desired result is a little more complicated, but still doesn't require recursion or an analytics window.

Let me know.

 

Enthusiast

Re: Partitioning Data in a Rolling 12 month window

Sorry you missed the point. The first date in the series fixes the 12 month window till the date of the rows below passes it. When this happens that date becomes the begining period for the next 12 month period and the add_months function is applied to it. It took me a while to figure it out but I have a solution. Its not elegant but it works.

Junior Supporter

Re: Partitioning Data in a Rolling 12 month window

Did it look anything like this?

select truck

         ,sts

         ,sts_dt

        ,end_date

from (select x.truck

       ,x.sts

       ,x.sts_dt

       ,b.end_date

       ,b.end_date - x.sts_dt as diff

  from rs150000.truck_dates as x

       ,(select add_months(first_dt_yr,12) as end_date

               from (select truck

                                   ,sts

                                 ,extract(year from sts_dt) as sts_yr

                                 ,min(sts_dt) as first_dt_yr

                             from rs150000.truck_dates

                           group by 1,2,3 ) as a

                       ) as b

           where b.end_date - x.sts_dt <= 366

               and b.end_date - x.sts_dt > 0 ) as c

  qualify rank() over(partition by sts_dt order by diff) = 1

order by 3,4;

 ;

 

Enthusiast

Re: Partitioning Data in a Rolling 12 month window

Thanks rs150000    that looks a more elegant solution than what I have.  I had to break it down into several steps with volatile tables. I'll check your solution out against mine later today during business hours. Thanls again fo the post.                         

Enthusiast

Re: Partitioning Data in a Rolling 12 month window

Hi rs150000 your figures reconcile to mine for the first year but unfortunately it won't roll forward to future periods because its only calculating 1 single 12 month period with  ' WHERE b.end_date - x.sts_dt <= 366 '. I also added 'truck' into the derived table  C so that I could join onto it rather than cross joining (reduce resouce use on the DWH).

 

 

 

SELECT truck

         ,sts

         ,sts_dt

        ,end_date

FROM (SELECT x.truck

       ,x.sts

       ,x.sts_dt

       ,b.end_date

       ,b.end_date - x.sts_dt AS diff

  FROM rs150000.truck_dates AS x

     JOIN (SELECT truck -- added for join
       
  , Add_Months(first_dt_yr,12) AS end_date

               FROM (SELECT truck

                                   ,sts

                                 ,Extract(YEAR From sts_dt) AS sts_yr

                                 ,Min(sts_dt) AS first_dt_yr

                             FROM rs150000.truck_dates

                           GROUP BY 1,2,3 ) AS a

                       ) AS b
   ON x.truck =b.truck
   
           WHERE b.end_date - x.sts_dt <= 366 -- Single 12 month period

               AND b.end_date - x.sts_dt > 0 ) AS c

  QUALIFY Rank() Over(PARTITION BY sts_dt ORDER BY diff) = 1

ORDER BY 3,4;

 ;

 
Enthusiast

Re: Partitioning Data in a Rolling 12 month window

This is what I have so far:

 

---------------------------------------------------------------------------------------------------
-- Step 3 Work out the rolling 12 month periods per departure per truck
---------------------------------------------------------------------------------------------------
 
 
 -- get all the departure dates for each truck pass  start date
 
-- DROP  TABLE VT_Depart;
CREATE  MULTISET VOLATILE TABLE VT_Depart
AS(
 SEL
 A.Truck 
 ,Sum(1)
 Over(
 PARTITION BY A.Truck 
 ORDER BY A.TruckSts_Dt ASC,A.TruckSts_Tm ASC
 ROWS Unbounded Preceding
 ) AS rw
 ,Max ( TruckSts_Dt)  
 Over (
  PARTITION BY B.Truck
  ORDER BY  TruckSts_Dt
     ,TruckSts_Tm ASC
  ROWS BETWEEN 1 Preceding AND 1 Preceding
   ) AS  Prev_Sts_Dt  
 ,A.TruckSts AS Cur_Sts
 ,A.TruckSts_Dt  AS Cur_Sts_Dt
 ,A.TruckSts_Tm  AS Cur_Sts_Tm
 FROM DB.TruckMvments AS A
  
 WHERE A.TruckSts  ='D'
 AND A.TruckSts_Dt>=DATE'2016-01-01'
  
)WITH DATA
 PRIMARY INDEX(Truck,Cur_Sts,Cur_Sts_Tm)
ON COMMIT PRESERVE ROWS;
 
 
 
 -- work out the first departure data and sum all the days between departures
 -- the divide by 365 to find the start and end periods for each rolling 12 months
 -- SEL * FROM VT_Perd_12Mth ORDER by 1,2;
 -- DROP TABLE VT_Perd_12Mth;
 CREATE  MULTISET VOLATILE TABLE VT_Perd_12Mth
AS(
 SEL
dt.*
,Sum_Dys/(365.00)  AS Perd_prec
,0.5 AS rnd
, CASE
 WHEN dt.rw = 1
  AND Prev_Sts_Dt IS NULL THEN 1
 ELSE Cast((Perd_prec +rnd) AS DECIMAL(3,0) )
 END AS Perd_12Mth
FROM (
  SEL
  Truck                       
  ,rw                           
  ,Prev_Sts_Dt                  
  ,Cur_Sts                      
  ,Cur_Sts_Dt                   
  ,Cur_Sts_Tm
  ,Cur_Sts_Dt - Coalesce(Prev_Sts_Dt,Cur_Sts_Dt) AS Dys
  ,Sum(Dys )
  Over(
   PARTITION BY Truck
   ORDER BY rw ASC
   ROWS Unbounded Preceding
  ) AS Sum_Dys
  FROM VT_Depart
) AS dt
 
)WITH DATA
PRIMARY INDEX(Truck,Cur_Sts,Cur_Sts_Tm)
ON COMMIT PRESERVE ROWS;
 
 

 --help table DB.TruckMvments
 -- SEL * FROM VT_Acnt_Perd order by 1,2;
 -- DROP TABLE VT_Acnt_Perd;
--SEL * FROM VT_Acnt_Perd;
--DROP TABLE VT_Acnt_Perd;
CREATE MULTISET VOLATILE TABLE VT_Acnt_Perd
AS(
 
SEL
  dt.Truck      
  ,dt.Perd_12Mth 
  ,dt.Perd_Bg_Dt
  ,Least(dt.Add_Months_Dt, Current_Date) AS Perd_End_Dt 
FROM( 
  SEL
  Truck                       
  ,rw                           
  ,Perd_12Mth  
  ,Min(Cur_Sts_Dt)
  Over (
 PARTITION BY Truck,Perd_12Mth

 ) AS Perd_Bg_Dt
  ,Min(Add_Months(Cur_Sts_Dt,12))
  Over (
   PARTITION BY Truck,Perd_12Mth

      ) AS Add_Months_Dt  
  FROM  VT_Perd_12Mth
)dt 
 
GROUP BY
  dt.Truck      
  ,dt.Perd_12Mth 
  ,dt.Perd_Bg_Dt
  ,Perd_End_Dt 
 
)WITH DATA
PRIMARY INDEX( Truck,Perd_Bg_Dt, Perd_End_Dt)
ON COMMIT PRESERVE ROWS;
 
 

Junior Supporter

Re: Partitioning Data in a Rolling 12 month window

I thought I'd reply here rather than on your post with the longer set of code.  The condition with end date minus sts_dt <= 366 is actually operating on more than 1 12 month window.  And my results did roll forward to the following year.  My results had matched your sample results exactly (except for the date formats.)  So I thought I'd repost with some added detail in case I wasn't totally clear.

 

I post my results first, then the sql with comments inserted about each of the steps.  I did add truck to one of the joins - it surely better for the optimizer when looking at your actual data - your sample all had the same truck id.

So these are my results from the tables I had created in the earlier post:

truck sts sts_dt end_date

AAG191D2016-02-192017-02-19
AAG191D2016-05-082017-02-19
AAG191D2016-09-012017-02-19
AAG191D2017-03-022018-03-02
AAG191D2017-06-122018-03-02
AAG191D2017-08-202018-03-02
AAG191D2018-02-222018-03-02

Which seems to have worked for all 3 status years.

So here's my commented code:

 

select truck
     ,sts
     ,sts_dt
     ,end_date
from (select x.truck
           ,x.sts
           ,x.sts_dt
           ,b.end_date
           ,b.end_date - x.sts_dt as diff
        from rs150000.truck_dates as x
       -- a comma without the keyword 'join' is alternate syntax
       -- and requires I use 'where' instead of 'on' for the join
       ,(select truck
               ,add_months(first_dt_yr,12) as end_date

        -- in derived table b first_dt_yr has taken 3 distinct values from dervied a
         -- 12 months are added to those 3 different values and the name changed to end_date
         from (select truck
                     ,sts
                     ,extract(year from sts_dt) as sts_yr
                     ,min(sts_dt) as first_dt_yr
                 from rs150000.truck_dates
                 group by 1,2,3 ) as a
              -- derived table a generates the minimum date per year for each year
              -- so with sts_dt covering 3 different years, there are 3 distinct values for first_dt_yr
         ) as b
         where b.truck = x.truck
        -- even though all sample truck IDs are the same, the truck condition is definitely wiser
        -- we still get a partial cartesian where all end dates will appear with all status dates
        -- which is why the date tests are needed
        and b.end_date - x.sts_dt <= 366
        -- this subtraction is actually a test on mulitple 12 month ranges
        -- with negatives removed in the next condition, only 1 sts_dt has 2 end dates
        -- the 2018 date still returns 2 rows, one with a difference of 8 days & one with 365
        -- the extra row for 2018 gets removed with the qualify in the outer condition
        and b.end_date - x.sts_dt > 0
        -- because of the partial cartesian some end dates fall before the status date
        -- the greater than 0 condition removes those  ) as c
qualify rank() over(partition by sts_dt order by diff) = 1
order by 3,4;

 

I'll also show you some results with some of the above conditions removed:

This is the output with the date condtions removed:

truck sts sts_dt end_date diff

AAG191D2016-02-192017-02-19366
AAG191D2016-05-082017-02-19287
AAG191D2016-09-012017-02-19171
AAG191D2017-03-022018-03-02365
AAG191D2017-03-022017-02-19-11
AAG191D2017-06-122018-03-02263
AAG191D2017-06-122017-02-19-113
AAG191D2017-08-202018-03-02194
AAG191D2017-08-202017-02-19-182
AAG191D2018-02-222019-02-22365
AAG191D2018-02-222018-03-028
AAG191D2018-02-222017-02-19-368

Here's the output with the >0 condtion added: 

 

truck sts sts_dt end_date diff

AAG191D2016-02-192017-02-19366
AAG191D2016-05-082017-02-19287
AAG191D2016-09-012017-02-19171
AAG191D2017-03-022018-03-02365
AAG191D2017-06-122018-03-02263
AAG191D2017-08-202018-03-02194
AAG191D2018-02-222018-03-028
AAG191D2018-02-222019-02-22365

So the 2019 end date for the 2018 status ends up getting removed with the qualify.

I hope this works for you.

 

Junior Contributor

Re: Partitioning Data in a Rolling 12 month window

@rs150000:

There's no way to get this result without some kind of loop, e.g. WITH RECURSIVE or a WHILE in a Stored Procedure. 

Simply add INSERT INTO truck_dates VALUES('AAG191','D','2015-04-19'); 

 

@newb1:

Do you need a single Select (with a horrible performance) or can you use a Volatile Table or an SP?

How many rows per truck and how many years exists in average/max?

Enthusiast

Re: Partitioning Data in a Rolling 12 month window

Hi

 

Try your code with these dates (real data):

 

TruckStsSts_Dt
AAG015D14/09/2016
AAG015D17/11/2017
AAG015D9/12/2017

 

I get this:

TruckSts_DtPerd_End_Dt
AAG01514/09/201614/09/2017
AAG01517/11/201717/11/2018
AAG0159/12/201717/11/2018