Update row classification based on previous column values

Database
Highlighted
Enthusiast

Update row classification based on previous column values

Hi! I need to group column values, by making them stable (without changes) over at least X records (this number is a parameter). As a first step, I would like to identify small length "islands" (by small I mean less than X records), and aggregate these to the previous one (by date). I'm using Teradata 16.10.

 

An example of the data structure (consider real data is 2 billions rows), is the following:

 

|    date    | part | value |
| 2017-09-11 |   A  |  600  |
| 2017-09-12 |   A  |  600  |
| 2017-09-13 |   A  |  550  |
| 2017-09-14 |   A  |  550  |
| 2017-09-15 |   A  |  550  |
| 2017-09-16 |   A  |  550  |
| 2017-09-17 |   A  |  520  |	
| 2017-09-18 |   A  |  520  |	
| 2017-09-19 |   A  |  510  |	
| 2017-09-20 |   A  |  530  |	
| 2017-09-21 |   A  |  550  |
| 2017-09-22 |   A  |  550  |
| 2017-09-23 |   A  |  560  |
| 2017-09-11 |   B  |  560  |
| 2017-09-12 |   B  |  560  |
| 2017-09-13 |   B  |  560  |
| 2017-09-14 |   B  |  560  |
| 2017-09-15 |   B  |  600  |
| 2017-09-16 |   B  |  550  |

 

I was able to classify groups after a change in value, given a partition and ordering by date. I'm using as group label, the row number of the first element in the group, so I can extract the group size from this column, like this:

 

|    date    | part | value | grp | 
| 2017-09-11 |   A  |  600  |  1  |
| 2017-09-12 |   A  |  600  |  1  |
| 2017-09-13 |   A  |  550  |  3  |
| 2017-09-14 |   A  |  550  |  3  |
| 2017-09-15 |   A  |  550  |  3  |
| 2017-09-16 |   A  |  550  |  3  |
| 2017-09-17 |   A  |  520  |  7  |
| 2017-09-18 |   A  |  520  |  7  |
| 2017-09-19 |   A  |  510  |  9  |
| 2017-09-20 |   A  |  530  | 10  |
| 2017-09-21 |   A  |  550  | 11  |
| 2017-09-22 |   A  |  550  | 11  |
| 2017-09-23 |   A  |  560  | 13  |
| 2017-09-11 |   B  |  550  |  1  |
| 2017-09-12 |   B  |  560  |  2  |
| 2017-09-13 |   B  |  560  |  2  |
| 2017-09-14 |   B  |  560  |  2  |
| 2017-09-15 |   B  |  600  |  5  |
| 2017-09-16 |   B  |  550  |  6  |

 

I would like to implement something like  (1): "Given a partition, and ordering by date, if difference between grp[i] and grp[i-1] is smaller than X, then set grp[i] = grp[i-1]". How could I implement it in a TD way?

 

Thanks!

 

update:

 

The result if I use X=3, would be:

 

|    date    | part | value | grp | newgrp |
| 2017-09-11 |   A  |  600  |  1  |    1   |
| 2017-09-12 |   A  |  600  |  1  |    1   |
| 2017-09-13 |   A  |  550  |  3  |    1   |
| 2017-09-14 |   A  |  550  |  3  |    1   |
| 2017-09-15 |   A  |  550  |  3  |    1   |
| 2017-09-16 |   A  |  550  |  3  |    1   |
| 2017-09-17 |   A  |  520  |  7  |    7   |
| 2017-09-18 |   A  |  520  |  7  |    7   |
| 2017-09-19 |   A  |  510  |  9  |    7   |
| 2017-09-20 |   A  |  530  | 10  |   10   |
| 2017-09-21 |   A  |  550  | 11  |   10   |
| 2017-09-22 |   A  |  550  | 11  |   10   |
| 2017-09-23 |   A  |  560  | 13  |   10*  | 
| 2017-09-11 |   B  |  550  |  1  |    1   |
| 2017-09-12 |   B  |  560  |  2  |    1   |
| 2017-09-13 |   B  |  560  |  2  |    1   |
| 2017-09-14 |   B  |  560  |  2  |    1   |
| 2017-09-15 |   B  |  600  |  5  |    5   |
| 2017-09-16 |   B  |  550  |  6  |    5   |

If I apply (1), I get newgrp=13, and then, a group smaller than 3 elements, this always happens if the last group is smaller than X. So still (1) is not the final solution,

 

 

I am adding also the following code which creates table 2 (a reproducible example):

 

create multiset volatile table example_tbl, no log
( dt        date
, part      char(1)
, val       integer
, grp       integer
)
primary index (part)
on commit preserve rows;

insert into example_tbl (dt, part, val, grp) values (date '2017-09-11', 'A', 600, 1);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-12', 'A', 600, 1);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-13', 'A', 550, 3);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-14', 'A', 550, 3);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-15', 'A', 550, 3);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-16', 'A', 550, 3);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-17', 'A', 520, 7);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-18', 'A', 520, 7);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-19', 'A', 510, 9);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-20', 'A', 530, 10);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-21', 'A', 550, 11);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-22', 'A', 550, 11);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-23', 'A', 560, 13);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-11', 'B', 550, 1);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-12', 'B', 560, 2);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-13', 'B', 560, 2);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-14', 'B', 560, 2);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-15', 'B', 600, 5);
insert into example_tbl (dt, part, val, grp) values (date '2017-09-16', 'B', 550, 6);

 


Accepted Solutions
Teradata Employee

Re: Update row classification based on previous column values

 

I get newgrp=13, and then, a group smaller than 3 elements, this always happens if the last group is smaller than X 

So about part B, the last two rows should also be in newgrp 1 ?

 

For just the rule (1), you can use this recursive query :

with recursive cte_recurs (dt, part, val, grp, newgrp) as
(
  select dt, part, val, grp, grp
    from example_tbl
 qualify row_number() over(partition by part order by dt asc) = 1
   union all
  select tbl.dt
       , tbl.part
       , tbl.val
       , tbl.grp
       , case when tbl.grp - cte.newgrp < 3 then cte.newgrp else tbl.grp end
    from cte_recurs  as cte
    join example_tbl as tbl  on tbl.part = cte.part
                            and tbl.dt   = cte.dt + 1
)
  select dt, part, val, grp
       , newgrp
    from cte_recurs
order by part, dt;

dt          part  val  grp  newgrp
----------  ----  ---  ---  ------
2017-09-11  A     600    1       1
2017-09-12  A     600    1       1
2017-09-13  A     550    3       1
2017-09-14  A     550    3       1
2017-09-15  A     550    3       1
2017-09-16  A     550    3       1
2017-09-17  A     520    7       7
2017-09-18  A     520    7       7
2017-09-19  A     510    9       7
2017-09-20  A     530   10      10
2017-09-21  A     550   11      10
2017-09-22  A     550   11      10
2017-09-23  A     560   13      13
2017-09-11  B     550    1       1
2017-09-12  B     560    2       1
2017-09-13  B     560    2       1
2017-09-14  B     560    2       1
2017-09-15  B     600    5       5
2017-09-16  B     550    6       5
1 ACCEPTED SOLUTION
7 REPLIES
Teradata Employee

Re: Update row classification based on previous column values

Hi storres,

 

You can try something like this.

The datas :

 

create multiset volatile table mvt_data, no log
( dt        date
, part      char(1)
, val       integer
)
primary index (part)
on commit preserve rows;

insert into mvt_data (dt, part, val) values (date '2017-09-11', 'A', 600);
insert into mvt_data (dt, part, val) values (date '2017-09-12', 'A', 600);
insert into mvt_data (dt, part, val) values (date '2017-09-13', 'A', 550);
insert into mvt_data (dt, part, val) values (date '2017-09-14', 'A', 550);
insert into mvt_data (dt, part, val) values (date '2017-09-15', 'A', 550);
insert into mvt_data (dt, part, val) values (date '2017-09-11', 'B', 550);
insert into mvt_data (dt, part, val) values (date '2017-09-12', 'B', 560);
insert into mvt_data (dt, part, val) values (date '2017-09-13', 'B', 560);
insert into mvt_data (dt, part, val) values (date '2017-09-14', 'B', 560);
insert into mvt_data (dt, part, val) values (date '2017-09-15', 'B', 600);
insert into mvt_data (dt, part, val) values (date '2017-09-16', 'B', 650);
insert into mvt_data (dt, part, val) values (date '2017-09-17', 'B', 650);

The query :

with cte_data (dt, part, val, newgrp) as
(
  select dt
       , part
       , val
       , abs(val - lag(val, 1, val) over(partition by part order by dt asc)) as val_leap
       , case
           when val_leap >= 15
           then 1
         end as newgrp
    from mvt_data
)
  select dt
       , part
       , val
       , count(newgrp) over(partition by part order by dt asc rows unbounded preceding) as grp
    from cte_data
order by part, dt;

 

 

 

Enthusiast

Re: Update row classification based on previous column values

Hi Waldar, thanks!

I'm having trouble with the function "lag", if I simply do:

select dt
       , part
       , lag(val, 1, val) over(partition by part order by dt asc) as val_leap
 from mvt_data

I get "Syntax error: Data Type "val" does not match a Defined Type name." Is this function available just in 16.10? 

Tags (1)
Teradata Employee

Re: Update row classification based on previous column values

Yes it appeared in 16.10.

Go for this formula then :

coalesce(min(val) over(partition by part order by dt asc rows between 1 preceding and 1 preceding), val)
Enthusiast

Re: Update row classification based on previous column values

I see, I have edited my question with more information, and a desired outcome. I wasn't so clear in the first try!

Teradata Employee

Re: Update row classification based on previous column values

 

I get newgrp=13, and then, a group smaller than 3 elements, this always happens if the last group is smaller than X 

So about part B, the last two rows should also be in newgrp 1 ?

 

For just the rule (1), you can use this recursive query :

with recursive cte_recurs (dt, part, val, grp, newgrp) as
(
  select dt, part, val, grp, grp
    from example_tbl
 qualify row_number() over(partition by part order by dt asc) = 1
   union all
  select tbl.dt
       , tbl.part
       , tbl.val
       , tbl.grp
       , case when tbl.grp - cte.newgrp < 3 then cte.newgrp else tbl.grp end
    from cte_recurs  as cte
    join example_tbl as tbl  on tbl.part = cte.part
                            and tbl.dt   = cte.dt + 1
)
  select dt, part, val, grp
       , newgrp
    from cte_recurs
order by part, dt;

dt          part  val  grp  newgrp
----------  ----  ---  ---  ------
2017-09-11  A     600    1       1
2017-09-12  A     600    1       1
2017-09-13  A     550    3       1
2017-09-14  A     550    3       1
2017-09-15  A     550    3       1
2017-09-16  A     550    3       1
2017-09-17  A     520    7       7
2017-09-18  A     520    7       7
2017-09-19  A     510    9       7
2017-09-20  A     530   10      10
2017-09-21  A     550   11      10
2017-09-22  A     550   11      10
2017-09-23  A     560   13      13
2017-09-11  B     550    1       1
2017-09-12  B     560    2       1
2017-09-13  B     560    2       1
2017-09-14  B     560    2       1
2017-09-15  B     600    5       5
2017-09-16  B     550    6       5
Enthusiast

Re: Update row classification based on previous column values

Thanks for the query! Do you think it would run faster using procedures or functions?

 

So about part B, the last two rows should also be in newgrp 1 ?

You are right about the last two rows of part B, they should be newgrp=1 based on rule (1).

Teradata Employee

Re: Update row classification based on previous column values

Well, with 2 billions rows it's hard to say, you will have to try it.

Recursive queries aren't the fastest thing done by RDBMS.

 

Maybe you can check the performance using one part, then 10 maybe to see how it scales.

 

For the second rule, I've coded another recursive query to achieve it.

I don't think they are mergeable into one query, so materialize the inbetween step.

 

-- Same query as before, I've rename newgrp to intgrp and put the results into a volatile table

create multiset volatile table example_tbl_intgrp, no log
as
(
with recursive cte_recurs (dt, part, val, grp, intgrp) as
(
  select dt, part, val, grp, grp
    from example_tbl
 qualify row_number() over(partition by part order by dt asc) = 1
   union all
  select tbl.dt
       , tbl.part
       , tbl.val
       , tbl.grp
       , case when tbl.grp - cte.intgrp < 3 then cte.intgrp else tbl.grp end
    from cte_recurs as cte
    join example_tbl as tbl  on tbl.part = cte.part
                            and tbl.dt   = cte.dt + 1
)
select dt, part, val, grp, intgrp
  from cte_recurs
)
with data
unique primary index (part, dt)
on commit preserve rows;

-- Second recursion
with recursive cte_recurs (part, intgrp, ordgrp, cntgrp, newgrp) as
(
select part, intgrp, ordgrp, cntgrp, intgrp
  from cte_compute
 where ordgrp = 1
 union all
select ctc.part
     , ctc.intgrp
     , ctc.ordgrp
     , ctc.cntgrp
     , case when ctc.cntgrp < 3 then ctr.newgrp else ctc.intgrp end
  from cte_recurs  as ctr
  join cte_compute as ctc  on ctc.part   = ctr.part
                          and ctc.ordgrp = ctr.ordgrp + 1
)
  ,  cte_compute (part, intgrp, ordgrp, cntgrp) as
(
  select part
       , intgrp
       , row_number() over(partition by part order by intgrp asc)
       , count(*)
    from example_tbl_intgrp
group by part
       , intgrp
)
  select tbl.dt, tbl.part, tbl.val, tbl.grp
       , cte.newgrp
    from example_tbl_intgrp as tbl
    join cte_recurs         as cte  on cte.part   = tbl.part
                                   and cte.intgrp = tbl.intgrp
order by tbl.part asc
       , tbl.dt   asc;

dt          part  val  grp  newgrp
----------  ----  ---  ---  ------
2017-09-11  A     600    1       1
2017-09-12  A     600    1       1
2017-09-13  A     550    3       1
2017-09-14  A     550    3       1
2017-09-15  A     550    3       1
2017-09-16  A     550    3       1
2017-09-17  A     520    7       7
2017-09-18  A     520    7       7
2017-09-19  A     510    9       7
2017-09-20  A     530   10      10
2017-09-21  A     550   11      10
2017-09-22  A     550   11      10
2017-09-23  A     560   13      10
2017-09-11  B     550    1       1
2017-09-12  B     560    2       1
2017-09-13  B     560    2       1
2017-09-14  B     560    2       1
2017-09-15  B     600    5       1
2017-09-16  B     550    6       1