I have two tables (both tables have 4 integer columns) one table with 70k records and other with 1.1 million records. I need to join both of them without any condition based on three columns. I have primary index based on three columns. So the skew of these tables is not more than 7. When I run the query I get no spool space error.
I ran the query in Visual Explain, I see the query is estimated to take 7 seconds only. Whle the query is running, when i checked (select * from dbc.DiskSpace where DatabaseName =user). I see CurrentSpool for one AMP increases to the max spool value and I get the spool error. But the data distribution is almost even in both the tables, not sure why currentspool is increasing only for one Amp.
Any help will be appreciated.
What are the demographics of these tables .... e.g unique values per column? No more spool-space error occurs when it is running out of space on 1 AMP, so the way you write the SQL also matter.
Can you share your query?
Table1 has col1, col2, col3, index is compiste index based on these three columns.
Table2 has col1, col2, col3, here too index is compiste index based on these three columns.
select t1.col4, t2.col4 from Table1 t1
inner join Table2 t2 ON t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3
It is a simple inner join of these tables on these columns to get the result. Per column wise, the unique values are less, but if you combine three columns then there are many unique values. Skew is only 3 in each table.
sel hashamp(hashbucket(hashrow(col1,col2,col3))),count(*) from table1 group by 1 order by 2 desc;
sel hashamp(hashbucket(hashrow(col1,col2,col3))),count(*) from table1 group by 2 order by 2 desc
Please check from above if one of the amps is overloaded. i assume that it is a NUPI on both the tables so one of the amps would be heavily loaded. Also please collect the stats on the 2 tables for all the column combination.
i new to teradata, now i am working on to reduce skew. here is my details
initially my table has 2 PI having skew 60
now i create a table on tmp database with 3 PI which has skew 3.5 only
shall we go with this for reduction of skew or i need to check for col stats. can anyone tell me about the use col stats
As i mentioned in the other post below, skew should not be the only factor deciding the PI but the higher the skew the more the time consumed for working on the skewed data.
Collect Stats is primarily to indicate to the optimiser on the size of the tables, the number of values in the column etc. to make it take the most optimisex plan for executing a query. for example, let us take a million row table being joined to a 100 row table, there are many ways that this can be done.i shall take a couple of ways to explain. If the optimiser knows that one of them is huge(million rows) and the other is a very small table, it can go with either duplicating the small table on all amps and in some cases even a product join of the 2 tables can be done. if the optimiser knows that both the tables are considerably huge then it will join the table in the spool and then redistribute across all the amps etc., if the join is aPI based join then it will be building it in the amps and the like steps.
Collect stats provides a way to identify the demographics of the tables and get the best plan for the optimiser.
Skewfactor can also have a larger values if the data-set is small. How many AMPs you have on the system? And can you execute and share result of bellow SQL:
SELECT col1, col2, col3, COUNT(*) FROM table1 GROUP BY 1,2,3 HAVING COUNT(*) > 1;
You table is NUPI table. when you join two tables with NUPI your output will be multiple of your NUPI. for example
Table A is having 10 rows (3 columns (NUPI)) and table B is having 10 rows (3 columns (NUPI)). when you join Table A and Table B with NUPI, your output will be 100 rows.
Check the count for each unique value (NUPI).