I have a table at customer level and their corresponding payments data( Two columns: Cust_id and payment) containing 20 million rows. I am trying to find outliers(0.1% top and bottom) based on payments.One method is using rank function and giving percentile values to each customer , but rank fuction does not work for such a large dataset. Is their any other way to do it ?
why do you think that rank doesn't work for a 20 million row dataset? I wouldn't consider this big, I can easily run a similar query on VMWare version...
SELECT * FROM cust_table
QUALIFY -- maybe RANK instead of ROW_NUMBER
ROW_NUMBER() OVER (ORDER BY payment) <= COUNT(*) OVER () * 0.001
ROW_NUMBER() OVER (ORDER BY payment) >= COUNT(*) OVER () * 0.999
QUALIFY QUANTILE(1000, payment) in (0,999)
Actually, I have 2 segments across which I have to find outliers. So, code I have used is something like this and it is spooiling out.
Qualify Row_number() over(partition by offer_name,contact_status order by tpv) < count(*) * 0.001
group by 1,2,3,4;
this is not going to work, you must change the count to COUNT(*) OVER (). And you can probably remove the GROUP BY, too.
What are the datatypes of those columns?
Thanks for the quick response
Data types are
cust_id CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
OFFER_NAME VARCHAR(130) CHARACTER SET LATIN NOT CASESPECIFIC,
and even with COUNT(*) over() and removing group by , it is spooling out