SQL Question

Tools & Utilities

SQL Question

I am working with a VERY large table (appx 15 Billion rows) and I want to pull back only those rows when conditions based on the previous row are met.

For example, if the table contained the following rows:

Customer Event_Time Event_Type
1 1 A
1 6 B
1 7 A
1 9 C
2 2 A
1 4 B
1 8 D
2 3 B

I would only want to return these two rows:

Customer Event_Time Event_Type
1 4 B
1 6 B

-- in essence, I want to return rows that after being sorted by Customer and Event_Time have the same value in Event_Type. Also, there can be more than two rows consecutive rows with the same Event_Type; in those cases, I need ALL the rows (e.g. if there were another row with Customer=1 and Event_Time=5 and Event_Type=B I would want that row as well).

My current approach is to create a table which contains a portion of the data (roughly 1B rows) in sort order and then use SAS to determine which consecutive rows to keep. I know it would be much faster if I could do the row selection in TD rather than bringing it down and processing in SAS.

Thanks in advance for any advice.
3 REPLIES
SN
Enthusiast

Re: SQL Question

hi Andrew,

here's a way to do it in TD:

1. sort the records based on cust & event_tm and assign a incremental sequence number( say SEQ_NUM) for each row

** since you may have 15 billion recs, define the SEQ_NUM column as data type 'LONG VARCHAR'

create table tab1(
cust ***,ev_tm **** ,ev_ty ****,SEQ_NUM LONG VARCHAR) unique primary index(seq_num);

2. Insert into this table with the records sorted and sequence number assigned for each row:

Ins tab1
sel cust,event_tm,event_type,csum(1,cust,ev_tm) as R_NUM
from

3. the select query should give you the required answerset:

sel a.cust,a.event_tm,a.event_type
from
tab1 a
,(sel * from tab1)b
where a.ev_ty=b.ev_ty and
(a.flag=b.flag-1 or a.flag=b.flag+1) ;

hope this helps!

SN
Enthusiast

Re: SQL Question

a typo correction in myearlier reply:

sel a.cust,a.event_tm,a.event_type
from
tab1 a
,(sel * from tab1)b
where a.ev_ty=b.ev_ty and
(a.flag=b.flag-1 or a.flag=b.flag+1) ;

*** replace the field name 'flag' with seq_num
Enthusiast

Re: SQL Question

Just a thought but what about the use of RANK() ?

RANK

RANK is the most straightforward of the OLAP extensions. It not only sorts a result set but also identifies the numeric rank of each row in the result. RANK has only the sort column(s) as its argument and returns an integer that represents the rank of each row in the result. For example, here’s how the RANK function lets you sort employees alphabetically and identify their level of seniority in the company:

SELECT EmployeeName, (HireDate - DATE) AS ServiceDays,
RANK( ServiceDays ) as Seniority
FROM Employee
ORDER BY EmployeeName;

EmployeeName Service Days Seniority
Robyn Baker 9931 2
Nick Garrison 9931 2
Kyle McVicker 9408 5
Eva O’Malley 10248 1
Norma Powers 9409 4

And the following example operates on the result of a derived table and join to sort items by category and descending overall rank of revenue:

SELECT Category, Item, Revenue, RANK( Revenue ) AS ItemRank
FROM ItemCategory,
(SELECT Item, Sum( Sales ) as Revenue
FROM DailySales
GROUP BY Item) AS ItemSales
WHERE ItemCategory.Item = ItemSales.Item
ORDER BY Category, ItemRank DESC;

Category Item Revenue ItemRank
Hot Cereal Regular Oatmeal 39112.00 4
Hot Cereal Instant Oatmeal 44918.00 3
Hot Cereal Regular COW 59813.00 2
Hot Cereal Instant COW 75411.00 1