Thanks for that info.
So the explain estimate is 14 hours. Hour long did the actual run take? or How long did you allow it to run for before cancelling the query?
Or are you basing your 'too slow' estimate on the estimated time shown in the Explain?
If the answer to my last question is 'Yes' then that is not a good idea. The times shown in an Explain plan usually have no relationship to the real elapsed time of the query.
For example this time doesn't take account of machine loading, priority etc. The elapsed time should be used as comparative between two pieces of SQL that both have the same functionality running against the same data (basically two ways of writing the same SQL).
The Explain plan that you show is exactly what I'd expect to see for this processing.
The Partition is default 65.535 with data in 143.
Here are Input Table Stats.
Unique Values Column Names 4,357,183,253 * 143 PARTITION 13,483,051 AR_ID 10 CR_DB_IND_CL 2,071 EFF_DT 1,911 END_DT 2 INT_APL_TP_CL 2 INT_RATE_CSZ_F 41 INT_TP_CL 143 LDB_ID 11 SEQ_ID 4,357,183,253 AR_ID,EFF_DT,INT_TP_CL,LDB_ID,SEQ_ID 13,483,051 PARTITION,AR_ID 85,255,729 AR_ID,INT_TP_CL,LDB_ID,SEQ_ID 13,599,979 AR_ID,LDB_ID 426 PPN_EV_ID 394,245 EFF_DT,END_DT 13,599,979 PARTITION,AR_ID,LDB_ID 74,917,253 AR_ID,LDB_ID,INT_TP_CL 74,917,253 AR_ID,INT_TP_CL 10,582 INT_BAL_MX_AMT
I cancelled the Original CREATE TABLE AS query after two hours because of the DBC block.
I have just Executed the INSERT/SELECT query and will leave it running for at least an hour to see if the Explain is just way off.
I would expect this kind of FastPath INSERT/SELECT query to be done in an hour? It's 'only' 280GB FastPath.
Thanks. What you might also want to do (using another sessions) is:
- run "SELECT TIME,SUM(CURRENTPERM) FROM DBC.TableSizeV WHERE DATABASENAME = '**bleep**' AND TableName = 'yyy';"
- run "LOCKING TABLE yyy FOR ACCESS SELECT TIME,COUNT(*) FROM yyy;"
'yyy' is the name of your table
This wil give an idea of the growth of data in your table from both a row count and perm space perspective.
Looking at Viewpoint during the INSERT/SELECT, I am a bit surprised to see that only two AMPS are doing work.
This snapshot is from the query halfway mark (0:32:00)
98,7% CPU and IO skew.
So if only 2 AMPs are doing any work that would explain why it is so slow.
This has to be the PI of the target table and the number of unique values. You might be REALLY unlucky and have lots of unique values which hash to only 2 AMPs, but unlikely...
However, from what you've shown earlier column AR_ID is the PI and there are ~1.3M unique values and your skew is ~1%. Only two AMPs doing the processing doesn't make sense (at least to me...)
Sorry to ask, but are you really sure about the skew of the table? Please try the following:
SELECT 100.00 * AVG(currentperm) / MAX(currentperm) AS par_eff FROM dbc.tablesizev WHERE databasename = 'x' AND TABLENAME = 'y';
If the skew is 1% then this should come back with an answer of 99%
I think you said that you're on TD15.x, so it isn't a 2-AMP hash map...
Unless there are row hash locks on every other AMP in the system which is causing the processing on those AMPs to block I'm struggling to understand why this might happen.
Just a thought (and I'm not sure if this is even possible) can you check for ARC locks on this table. Use Viewpoint / Remote Console / Query Session to display these.
Output of your skew Query is 99,50 = 1% Skew
No ARC Locks on the table.
Here are 4 screenshots from Viewpoint at the 5min, 15min, 25min and 55min mark.
The Query uses all AMPs for the first 10min.
At that point it only uses 2 AMPS for the remaining period and CPU and IO count stops in its tracks.
This points to massive data skew but the skew number for the table just doesn't back that up??? I'm really puzzled.
Thanks for using your spare time on helping me out.
Puzzled? You and me both!
Any Join or Hash Indexes defined on the target table?
Any triggers defined on the target table?
Apart from that I think you might need to speak to TD CS about this.
Sorry I couldn't help any further.
A couple paths to explore:
- The NUSI is fairly non-unique. Even though there is not skew in the table as a whole, are there AR_IDs that have a LOT more rows than the average of 323 rows/nusi value? And are there many LDB_IDs for a single AR_ID or is one AR_ID contained within one LDB_ID?
- Other workload. If there is other very skewed work on the platform, then this work will be slowed down on the AMPs where the other skewed work is running and will be stretched out on those AMPs. The other AMPs will complete, and these will need to keep running to get finished.
- As Dave suggested, ask Teradata CS to health check the hardware. We have had cases of "slow luns" where a disk or controller was accting poorly without raising any fault indications. Or a memory chip that is beginning to fail could cause a performance issue in a node. Point them at the two nodes identified as the skewed ones.