Getting a dynamic Sample

Database
Enthusiast

Getting a dynamic Sample

I am attempting  to extract Control Groups in our Automated Batch. The table in the batch I hope to extract from contains a Communication_ID and a Customer_id. Each night the no of COMMUNICATION_IDs and CUSTOMER_IDs will vary. How can I write some dynamic SQL that will select a random sample (10%) of customers per COMMUNICATION_ID and can I set a MIN or MAX on this also?

I am using SQL Assistant V13..

Any help would be greatly appreciated.

Many Thanks

Tags (1)
8 REPLIES
Senior Apprentice

Re: Getting a dynamic Sample

Is the number of COMMUNICATION_IDs known and small:

SAMPLE
WHEN COMMUNICATION_ID = x1 THEN 0.1
WHEN COMMUNICATION_ID = x2 THEN 0.1
WHEN COMMUNICATION_ID = x3 THEN 0.1
WHEN COMMUNICATION_ID = x4 THEN 0.1
ELSE 0.1
END

This could be created dynamically based on the actual data, but i don't know if there's a limit for the number of WHENs.

Or use "WHERE RANDOM(1,10) = 1", but this is not exactly 10% and might miss an id with a small number of rows (this will also happen with SAMLPE when the number of rows is less than 5).

What do you mean by "set MIN and MAX"?

Dieter

Enthusiast

Re: Getting a dynamic Sample

Hi Dieter....thanks for the reply. The number of CommunicationIds will not be known. It too will vary each evening.

What I mean by Max and Min is: if there are only say 10 customers for one communication then I may not want to extract 10% (as the number of customers selected is so small). I would like to select a % between a Min and a Maximum ie select 10% but no less than 10 and no more than 100....unsure how to go about doing this...

Senior Apprentice

Re: Getting a dynamic Sample

How large is this table?

OLAP functions are more expensive than RANDOM or SAMPLE but you could do it like this:

SELECT * FROM tab
QUALIFY
ROW_NUMBER() OVER (PARTITION BY communication_id ORDER BY HASHROW(xxx)) <= 10 -- at least 10 rows
OR (ROW_NUMBER() OVER (PARTITION BY communication_id ORDER BY HASHROW(xxx)) <=
COUNT(*) OVER (PARTITION BY communication_id) / 10 -- 10 percent of the rows
AND ROW_NUMBER() OVER (PARTITION BY communication_id ORDER BY HASHROW(xxx)) <= 100) -- maximum of 100 rows

As RANDOM can't be used within an OLAP function xxx must be a column which is (close to) unique.

Instead of HASHROW(xxx) you could simply use "1", but this will probably result in skewed spool.

Dieter

Enthusiast

Re: Getting a dynamic Sample

Hi Dieter....that looks like it could be very useful. The table I will be selecting from will 99% of the time be between 30k=100k rows (not that large). Forgive my ignorance in terms of SQL I am a relatively new user and am self taught so have not had much in the qay of guidance but does the above provide a Random sample and would it be much more expensive than using a Random?

Many Thanks for the help...it is greatly appreciated...
Senior Apprentice

Re: Getting a dynamic Sample

This should be exactly the output you wanted, it's not a true statistically valid sample, but probably close enough.

When you need to get a better sample using RANDOM you must to nest it in a Derived Table:
SELECT * FROM
(SELECT t.*, RANDOM(1,1000000000) as xxx FROM tab AS t) AS tab
QUALIFY
ROW_NUMBER() OVER (PARTITION BY communication_id ORDER BY xxx) <= 10
...

As your table is small, you can easily do it.

Dieter

Enthusiast

Re: Getting a dynamic Sample

Hi Dieter...If I use your original syntax would I produce the same customers each time. What I mean by this is that as the year goes on the group of customers will be available for selection again and again. Does hashrow not place that data (customer_id) on the same amp each time meaning that they would be picked as samples everytime they pass through?
Senior Apprentice

Re: Getting a dynamic Sample

The same value will always result in the same HashRow, that's why is said "xxx must be a column which is (close to) unique".
If it's the logical Primary Key of this table you're not going to load the same row a second time.

But just go for the Derived Table with RANDOM, it's easier to write and there's a big advantage:
The size of the RANDOM is always 4 bytes while the PK might be much more bytes, thus requiring more spool space.

Dieter
Enthusiast

Re: Getting a dynamic Sample

Great stuff ...thanks Dieter