I'm trying to join two tables, Left table "TableA" being (30 GB and 90 milion rows) with the another table "tableB" ( 2.5 GB and 21 million rows).
Following is the sql
Select * from TableA t1 left join TableB t2
and T1.ColumnX is not null;
CREATE SET TABLE TableA
column1 Integer Not null,
column2 BigInt Not null,
ColumnX Integer compress,
Unique Primary Index(Column1,Column2)
CREATE SET TABLE TableB
ColumnX Integer Not NULL,
Unique Primary Index (ColumnX);
The TableB is always duplicated across ALL AMps which is causing my code to run for long time and lot of CPU seconds. IS there Any way we can avoid this Duplication.
The alternative plan is to redistribute TableA which appears to be a fairly wide table, lots of bytes to redistribute. Are there lots of nulls in columnx in TableA? Are there lots of duplicate values in columnX in TableA?
The rest of the plan is not provided and evidence is of provided to show that the duplication is the costly part of the plan. Since columnx is not unique in TableA, a product join needs to be done within value of columnx in tableB, this join step is probably the costly one not the redistribution.