I have a 3 billion record table, which I staged into a NoPI table, with BLC.
the table shows as 123.39 GB, skew factor .077
when I insert/select that into a table with a PI on a column that is a unique key to the table, still BLC.
That table shows as 218.59 GB, skew factor .043
I am trying to understand why the table is almost 100GB Larger when I inserted it into a PI table, than when I TPT loaded it into a NoPI. I am going to try TPT loading the PI table directly to see if that causes any change. And also insert/select it back to a NoPI... results from that to come.
Any thoughts on this major size discrepency??
I copied the table back to a NoPI and it resulted in a 219.77GB .044 skewed table.
So now the question is why is the TPT loaded table so much smaller than the insert/selected table? it will take several hours to TPT load a table with a defined PI, but I will be doing that to see what the size of that table will be....
Have you Inserted NOPI to NOPI table? or PI to NOPI table?
For above two cases Inserts are AMP local so you should not see much difference in size between source table with target table.
NOPI to PI insert needs distribution and ofshocource due to Index it will store internally some more metadata which needs space. In this case yu will see size difference.
So the oddness continues.
Insert/Select from NoPI to NoPI, size remains about 123GB.
Insert/Select NoPI to PI, size grows to 218GB. Insert/Select from PI back to NoPI, size remains around 219GB.
TPT directly into PI table results in about 218GB.
The only thing I can conclude is that the TPT load into NoPI is getting much better BLC. Redistributing it by PI loses the BLC. Inserting back into a NoPI doesn't "unsort" the data, so you still have the same "reduced" BLC ratio?
Forcing no BLC turned the NoPI table to 386GB, and the PI table to 386GB also, so I can only assume it is a difference in the ratio of BLC. Interesting the TPT seems to get very good BLC, and once you sort the table via hashing, it gets worse :(
From your test it looks like it matches what i said before but your conclusion is not correct. IF you will insert to NOPI table using relational connection even though it will run for years ;) your table size will not change.
COPY tables with INS SEL works different way. If you have BLC enabled everytime it does not check because it is a over head.
When you do INS SEL from PI to NOPI it does not check BLC again because it assumes source PI table is already compressed and it does only AMP LOCAL INSERTS. So table size of NO PI table remains same as PI table.
If you would like to compress NOPI table while INSERTING from PI table you can do that by avoiding AMP LOCAL inserts. You need to check expain plan though. You can try to join PI table to a DUMMY single row with 1=1 condition and do INSERT to NOPI table. I belive you will see NOPI table size will be reduced compared to PI table.
1. The PI table has a PI, therefore it has an 8 byte rowid added to each row. 8 * 3B is 24GB of just rowids. And as you said it is a unique PI these bytes will not compress a lot.
2. The evidence you have provided suggests to me that the source data is clustered in some way from the source from whence it came. For instance if it is transaction data sorted by date, store, customer or product or a set of the above, there will be a lot of similar data in close proximity. This clustering is preserved to some degree when the data is loaded to a NOPI table because we take consecutive rows, pack them into a block, send that block to one amp and insert the records as a group. More occurrences of the same or similar data in a block will cause that block to compress better.
When the the same data is inserted into a PI table, it is distributed row by row by the hash of the PI and then inserted in row hash order into the table. This removes any clustering of the source data set, effectively randomizing it by the hash of the PI. The common values are no longer clustered and compression cannot remove as much commonality from each block.
3. When you copy PI to NOPI the ordering of the PI is used as the default ordering. If you know the clustering of the source data, you can partially emulate that by using the ORDER BY clause on the insert select. You cannot fully emulate it because the data has been distributed by hash of the PI and it would not be possible to put the rows back onto the same AMP together and make them be adjacent.
That is pretty much the conclusion that I came to. Interestingly, as I added more months of data do the table, the compression ratio seems to have improved again. I am guessing that is because the "similar" data is coming up again as I add many months of data for a subscriber (my new NUPI, as the UPI is not used often in joins), those subscriber rows are similar and are hashed together, hence better BLC.
I am too used to talking about columnar these days... My point 1 above is correct for columnar but not for a row form NoPI table. The row form NoPI table still has a rowid. Sorry if I confused anyone.