Data Distribution Skew

Database

Data Distribution Skew

I have 4000 rows in a table...
That table has a Unique primary index defined on one column. its datatype is varchar(19) datatype.
I have a 20 node system which has 200 AMPs. There is data distribution skew ranging from 90 to -50...

Could anyone help why this happens...is teradata parallelism works only with HUGE data ????

Thanks,
Pots.
10 REPLIES

Re: Data Distribution Skew

Teradata uses a hashing algorithm that somewhat mimics a "random" distribution. It's not completely random since the same primary index value will always go to the same AMP. The fewer rows that you have in your table compared to the number of AMPs in your system, the more uneven your distribution will be.

This is not any different than a true random distribution. If you had 200 rows in your table and you had 200 AMPs, if you randomly distributed the data across the AMPs, you would end up with no data on some of the AMPS and as many as 5 or more rows on other AMPs. As the number of rows in a table grows, the distribution of the table across AMPs will become more even.

You don't really have to worry about the uneven the distribution of small tables as long as your queries are not redistributing your large tables in order to join to your small tables. If this happens, your queries could become skewed to one or a few AMPs. Usually, if you have good statistics on the small table (and other tables involved in the query), the optimizer will instead choose to duplicate the small table across AMPs in order to join to larger tables. This would allow the query to be as evenly distributed as the large table.

Hope that helps.

Re: Data Distribution Skew

SELECT HASHAMP (HASHBUCKET (HASHROW (pi))), COUNT(*)
count(*)
from table
group by 1

you should first make sure you choose the right PI

Also , perhaps 4000 is too little for Teradata, perhaps you may use 1 million or more :-)

Regards:-)

Re: Data Distribution Skew

Hi All,
Please try all the options...still its not distributing data equally....Lets not work on DEMO VERSION OF TERADATA....
=========================================================
CREATE MULTISET TABLE TRAIN_DATA.Eql_Distribution ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
Column_1 INTEGER,
Column2 VARCHAR(9) CHARACTER SET LATIN CASESPECIFIC)
UNIQUE PRIMARY INDEX ( Column_1 );
=======================================================================
USE THE FOLLWOING to insert 2,00,000 rows and check the deviation using the below sql
===================
REPLACE PROCEDURE TRAIN_ETL_TARGET.Insert_rows()
BEGIN
DECLARE Cnt Integer DEFAULT 0;

LABEL_ONE:
WHILE Cnt <200000
DO

INSERT INTO TRAIN_DATA.EqualDistribution VALUES( :Cnt, 'ABC');
SET Cnt = Cnt + 1;
END WHILE LABEL_ONE;
END;

Then select whether there is even distribution across ALL AMPS, SOMEHOW I AM NOT CONVENSED with the UNIFORM DISTRIBUTION OF DATA
===============================================================
SELECT dt1.a (TITLE 'AMP')
,dt1.b (TITLE 'Rows')
,((dt1.b/dt2.x (FLOAT)) - 1.0)*100 (FORMAT'+++9%',TITLE 'Deviation')
FROM
(SELECT HASHAMP(HASHBUCKET(HASHROW(Column_1)))
,COUNT(*)
FROM TRAIN_DATA.TRY
GROUP BY 1
)dt1 (a,b)
,(SELECT (COUNT(*) / (HASHAMP()+1)(FLOAT))
FROM TRAIN_DATA.TRY
)dt2(x)
ORDER BY 2 DESC,1;

Re: Data Distribution Skew

Create a table with a million rows instead of 4000. Then you will be convinced....

Re: Data Distribution Skew

By the way, a much easier way to see if a table is distributed evenly is to simply check the currentperm used on each AMP. For example, the following query will tell you how much space is used on the table by AMP:

SELECT vproc
,currentperm
FROM dbc.tablesize
WHERE databasename = 'xxx'
AND tablename = 'yyy';

where xxx is your databasename and yyy is your table name. You can also compare the min to max as a good way to tell if the table is being distributed evenly.

SELECT max(currentperm) / min(currentperm)
FROM dbc.tablesize
WHERE databasename = 'xxx'
AND tablename = 'yyy';

This gives you the ratio of your largest space per AMP compared to the smallest.

It is very difficult not to have an evenly distributed table if you have a unique primary index and an adequate number of rows. You would literally have to know how the hashing algorithm works and purposely put in values that would cause it to be unevenly distributed.

The problem is that you do not have enough rows compared to the number of AMPs.

Re: Data Distribution Skew

Thanks Barry...

I tested with the UNIQUE PRIMARY INDEX for the following rows

ROWS SKEW AMPS
66000 6% 48
100000 4% 48
200000 1% 48
300000 2% 48
400000 1% 48
500000 1% 48
600000 0% 48

THanks for sharing knowledge...I was in the intention that if there is a UPI, then it will distribute data evenly across all AMPS.......:-(

As per the teradata manuals, if we have Unique Primary Index for a table, then the data will be distributed Avenly across all amps...

Thanks for throwing the LIGHT...

--Srini

Re: Data Distribution Skew

Thanks Barry...

I tested with the UNIQUE PRIMARY INDEX for the following rows

ROWS SKEW AMPS
66000 6% 48
100000 4% 48
200000 1% 48
300000 2% 48
400000 1% 48
500000 1% 48
600000 0% 48

THanks for sharing knowledge...I was in the intention that if there is a UPI, then it will distribute data evenly across all AMPS.......:-(

As per the teradata manuals, if we have Unique Primary Index for a table, then the data will be distributed Avenly across all amps...

Thanks for throwing the LIGHT...

--Pots

Re: Data Distribution Skew

Hi Experts,

If the Table is small and having the records which are less than the no.of amps , then in this case , then skewness or improper distribtuion is bound to happen  across the amps.

is there any way out that we can handle this ?

Cheers!

Nishant

Re: Data Distribution Skew

Hi Nishant,

if  a table is small then you do not need to worry about skew. It is unavaidable to have some skew and you do not need to worry about measuring it until you have at least several thousand rows per amp. The impact is minimal unless of course your PI is poorly chosen and all the data is on one amp.

--  Shelley