I am trying to insert records (72 million+) from one table into a larger "master" table. The master table will be the final table after everything is complete. This table has 330 million rows in it currently. The insert is a simple bteq script: Insert into "master table" Select * from "subset table"
I have broken the subset table into smaller table segments (4-10 million) thinking that would help ease the loading times. Even these smaller segments seem to take hours and in some cases days to load.
The master table is defines as a SET table, and has a Primary Index and partitions by year. I have collected statistics on the key and the partitions, but this does not seem to have helped with the time to load.
I was thinking about the SET vs MultiSet definition of the table. Will this help at all?
Every year I will receive another 100 million records that need to be loaded to this table, so improving the processing is important.
Any suggestions you have would be greatly appreciated.
If the table has a NUPI with lots of records with the same PI value (eg an account number for transactions), then Multiset will definitely help. It will bypass the duplicate row check which is almost certainly be causing the long load times. If you do go to a miltiset table though, you become responsible for ensuring there are no duplicate records!
You should take a look at your partitioning scheme. I presume that all (or the vast majority) of the records you load are for the most recent period. You do not say whether you load the 100M records once per year, or as 9M approx per month. If you load 100M in January, 90M of these will go into last years partition, adding to the 9-10M you loaded for last January. 10M will go into the new partition. The 10M will go in much quicker because they are going into am empty partition.
Try changing to monthly partitions instead of annual. This will limit you to 65K months, or 5000 years, so probably not a problem.
Here is what I get when I do a Help Statistics on the main table:
Date Time Unique Values Column Names 09/07/23 01:50:20 10,468,831 Irmf_Pyee_Id 09/07/23 01:50:24 6 PARTITION
Is this the info you were asking for?
Jim, the table is partitioned by YEAR. This is tax data I am dealing with, so I will get a new file once a year to load into this table. Unfortunately, I do not have a month that I could use for partitioning, only Tax Year.
Here are the key fields as defined in the DDL now:
PRIMARY INDEX XPKINFORMATION_RETURN ( Irmf_Pyee_Id ) PARTITION BY RANGE_N(CAST((Tax_Yr_Dte ) AS INTEGER) BETWEEN 1980 AND 2100 EACH 1 , UNKNOWN) INDEX Pyertest1 ( Pyer_Adrs_Id ) INDEX Pyeetest1 ( Pyee_Adrs_Id ) INDEX PYerSec2 ( Irmf_Pyer_Id );
Should I collect statistics on all of these index fields, in addition to the PI?
Hi Paul, this is approx. 33 rows per value in average, this should be ok (especially if it's in different years). But the maximum might be totally different, so please do a help stats only for the PI-column: HELP STATS tablename COLUMN Irmf_Pyee_Id;
If QueryLog is enabled on your site you might also check it for CPU-usage of those ins/sel queries.
I am going to ask a dumb question. When you say Help Statistics on the PI, what do you mean? In my post above I showed what I get when I do a Help Statistics on the table, and that field listed is the PI. I'm not sure what to do to get the Statistics on the PI.
Or are you asking about the skewness? That is 12.7
I agree...10 hrs to 4.5 mill is just unacceptable! That is why I am asking the forum for help.
I ran the Help Stats statement you provided (thanks), it produced a pretty long output record. I am not sure what all you need to see, but I took a guess. Here it is:
Number of Rows..........370692047 Number of Nulls...........0 Number of Intervals......100 Number of Uniques........10473786 Numeric......................N Min Value ...................0 Mode Value.................300027069 Mode Frequency..........5698859 Max Value..................300027069 Mode Value................141688470 Mode Frequency.........4787856