Outlier Treatment in Teradata

Analytics

Outlier Treatment in Teradata

Hi guys, 

I have a table at customer level and their corresponding payments data( Two columns: Cust_id and payment)  containing 20 million rows. I am trying to find outliers(0.1% top and bottom) based on payments.One method is using rank function and giving percentile values to each customer , but rank fuction does not work for such a large dataset. Is their any other way to do it ?

5 REPLIES
Senior Apprentice

Re: Outlier Treatment in Teradata

Hi Abhinav,

why do you think that rank doesn't work for a 20 million row dataset? I wouldn't consider this big, I can easily run a similar query on VMWare version...

SELECT * FROM cust_table
QUALIFY -- maybe RANK instead of ROW_NUMBER
ROW_NUMBER() OVER (ORDER BY payment) <= COUNT(*) OVER () * 0.001
OR
ROW_NUMBER() OVER (ORDER BY payment) >= COUNT(*) OVER () * 0.999

or using the deprecated QUANTILE function:

QUALIFY QUANTILE(1000, payment) in (0,999)

Re: Outlier Treatment in Teradata

Hey

Actually, I have 2 segments across which I have to find outliers. So, code I have used is something like this and it is spooiling out.

        select 
cust_id,
offer_name,
contact_status,
tpv
from table_name
Qualify Row_number() over(partition by offer_name,contact_status order by tpv) < count(*) * 0.001
group by 1,2,3,4;
Senior Apprentice

Re: Outlier Treatment in Teradata

Hi Abhinav,

this is not going to work, you must change the count to COUNT(*) OVER (). And you can probably remove the GROUP BY, too.

What are the datatypes of those columns? 

Re: Outlier Treatment in Teradata

Thanks for the quick response

Data types are

      cust_id CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
CONTACT_STATUS INTEGER,
OFFER_NAME VARCHAR(130) CHARACTER SET LATIN NOT CASESPECIFIC,
tpv DECIMAL(18,4)

and even with COUNT(*) over() and removing group by , it is spooling out 

Senior Apprentice

Re: Outlier Treatment in Teradata

Hi Abhinav,

seems like you have to talk to your DBA to get a higher spool limit.