Random sampling - ORDER BY HASHROW versus ORDER BY NULL

Database

Random sampling - ORDER BY HASHROW versus ORDER BY NULL

I'm developing a query to return a pseudo-random sample with the following characteristics:

- The number of LOC_ID records returned in the sample should be proportional to the percent of that LOC_ID overall (for the specified date range)

- The returned records should be unique

- It should return different records each time it runs

By perusing the posts listed below, I've come up with this test SQL. It works, except it's returning the same set of records each time. If I replace the "ORDER BY HASHROW(ACCOUNT.ACCOUNT_ID)" bit with "ORDER BY NUL"L it does return a different set each time. However, I don't know if using NULL somehow skews the results.

I'd appreciate any advice on how best to meet my goals above.

WITH PARAMS (SAMPLE_SIZE,START_DATE) AS

    (SELECT

        ?INPUT_SAMPLE_SIZE AS SAMPLE_SIZE,

        ?INPUT_START_DATE AS START_DATE

    )

SELECT

    ACCOUNT.LOC_ID,

    ACCOUNT.ACCOUNT_ID,

    ACCOUNT.CONTACT_DATE_TIME,

    SAMPLING.LOC_SAMPLE_SIZE,

    PARAMS.START_DATE,

    PARAMS.SAMPLE_SIZE

QUALIFY ROW_NUMBER() OVER (PARTITION BY ACCOUNT.LOC_ID ORDER BY HASHROW(ACCOUNT.ACCOUNT_ID)) <= SAMPLING.LOC_SAMPLE_SIZE

FROM ACCOUNT

    INNER JOIN

        (SELECT

            ACCOUNT.LOC_ID,

            (COUNT(ACCOUNT.LOC_ID) * 1.00000000 / (SELECT COUNT(*) FROM ACCOUNT WHERE ACCOUNT.CONTACT_DATE_TIME >= PARAMS.START_DATE)) * PARAMS.SAMPLE_SIZE AS LOC_SAMPLE_SIZE

        FROM ACCOUNT

            CROSS JOIN PARAMS

        GROUP BY ACCOUNT.LOC_ID

        WHERE ACCOUNT.CONTACT_DATE_TIME >= PARAMS.START_DATE) AS SAMPLING

            ON ACCOUNT.LOC_ID = SAMPLING.LOC_ID

    CROSS JOIN PARAMS

WHERE

    ACCOUNT.CONTACT_DATE_TIME >= PARAMS.START_DATE

ORDER BY 1


http://stackoverflow.com/a/10952697

http://forums.teradata.com/forum/database/getting-a-dynamic-sample

https://forums.teradata.com/forum/database/sample-without-duplicates

3 REPLIES
Enthusiast

Re: Random sampling - ORDER BY HASHROW versus ORDER BY NULL

This might help.

 CREATE Volatile TABLE l1 (loc_id SMALLINT) on commit preserve rows;
CREATE Volatile TABLE l2 (loc_id SMALLINT) on commit preserve rows;
CREATE Volatile TABLE l3 (loc_id SMALLINT) on commit preserve rows;
CREATE Volatile TABLE l4 (loc_id SMALLINT) on commit preserve rows;

INSERT INTO l1
SEL loc_id FROM Locs SAMPLE .25;

INSERT INTO l2
SEL loc_id FROM Locs
WHERE loc_id NOT IN (SEL loc_id FROM l1) SAMPLE .33;

INSERT INTO l3
SEL loc_id FROM Locs
WHERE loc_id NOT IN (SEL loc_id FROM l1
UNION SEL loc_id FROM l2) SAMPLE .5;

INSERT INTO l4
SEL loc_id FROM Locs
WHERE loc_id NOT IN (SEL loc_id FROM l1
UNION SEL loc_id FROM l2
UNION SEL loc_id FROM l3);;
SEL loc_id FROM Locs
MINUS(
SEL loc_id FROM l1
UNION
SEL loc_id FROM l2
UNION
SEL loc_id FROM l3
UNION
SEL loc_id FROM l4);

Sel loc_id from l1;
Sel loc_id from l2;
Sel loc_id from l3;
Sel loc_id from l4;

Rglass

Senior Apprentice

Re: Random sampling - ORDER BY HASHROW versus ORDER BY NULL

The ACCOUNT_ID is probaby the PK of you table, so you order by the hash of a a unique column, of course this returns the same result over and over again.

But when you ORDER BY NULL (or 0 or 42) all rows of a partition got the same value. You get a more random result, but all rows from an LOC_ID are processed on a single AMP -> calculation will probably be skewed.

Besides ORDER BY NULL is probably not really random, so when you want a random result without skewed calculation you need to order by a random value. As you can't use RANDOM directly within an OLAP function you have to nest it.

Regarding your approach, it will not return an exact number of rows as specified in SAMPLE_SIZE, for small numbers it might be totally wrong (you probably know that).

And it's overly complicated :-)

You better use this:

WITH cte AS
(
SELECT
...
,LOC_ID AS part_col
,?INPUT_START_DATE AS START_DATE
,RANDOM(-2147483648,2147483647) AS rnd -- maximum range of random values
FROM ACCOUNT AS t
WHERE CONTACT_DATE_TIME > START_DATE
)
SELECT cte.*
,CAST(?INPUT_SAMPLE_SIZE AS DEC(38,0)) AS SAMPLE_SIZE
,COUNT(*) OVER (PARTITION BY part_col) AS part_cnt
,part_cnt * SAMPLE_SIZE / (SELECT CAST(COUNT(*) AS DEC(38,0)) FROM cte) AS LOC_SAMPLE_SIZE
FROM cte
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY part_col
ORDER BY rnd) <= LOC_SAMPLE_SIZE
;

In fact your try to get a modified PERCENT_RANK (but this is less efficient):

QUALIFY 
PERCENT_RANK()
OVER (PARTITION BY part_col
ORDER BY rnd) * (SELECT CAST(COUNT(*) AS DEC(38,0)) FROM cte)
< sample_size

Re: Random sampling - ORDER BY HASHROW versus ORDER BY NULL

Thank you Dieter, this looks great. I'll leave another comment when I've had a chance to try it, but you've answered my questions just as I'd hoped, and cleaned it up, as I expected :).

I do realize that it returns less rows than the specified SAMPLE_SIZE, for example 9982 for 10,000. That's okay though.