Why don't you use ALTER TABLE to alter a table?

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Junior Contributor

To add or drop a column or modify the list of compressed values of an existing table is a quite expensive operation. For a large table it might result in a huge amount of CPU and IO usage and a loooooooong runtime. This blog discusses the pros and cons of the different ways to do it.

Alter Table vs. Insert Select vs. Merge Into

As always in SQL one got multiple choices to reach the same goal: modify a table directly or move the data to a new table. The former is ALTER TABLE (Alter), the latter INSERT SELECT (InsSel) or its less well-known variation MERGE INTO (Merge).

Let's start with a list of pros and cons (red = negative, green = positive):

  ALTER TABLE INSERT SELECT MERGE INTO
Needs Transient Journal? no no no
ABORT possible? no yes (fast) yes (fast)
Rollback during system restart? no yes (fast) yes (fast)
LOCK on source table

exclusive

read

read

Spoolspace used

no

yes, same as source

no, when PI doesn't change

Additional Permspace used

low, 2 cylinders per AMP

high, same as source

high, same as source

Works on a table copy? no yes yes
Must Create/Drop/Rename Table?

no

yes

yes

Must recreate

Secondary/Hash/Join Indexes


Foreign Keys/Statistics/Comments

Access Rights?
no yes yes
Supports changing

Primary Index/Partitioning?
no yes yes

You can easily spot that InsSel and Merge are quite similar, but Alter is usually different.

The only common ground is the Transient Journal, all three don't use it (of course there are some entries indicating there some work going on, but the actual rows are not journaled). Due to that fact InsSel and Merge can be easily aborted and will rollback quite fast (just deleting all rows in the target table), but once Alter started it must finish, there's no way to abort it. Even a system shutdown can't stop it, it will simply continue after the restart. Some will consider this as positive others as negative :-)

The most important difference is the availability during the restructure process: Both InsSel and Merge apply a read lock allowing concurrent read access while Alter needs an exclusive lock blocking any access to the target table. That's the main reason why Alter is not used in most environments. Additionally before TD13 there was a table level write lock on dbc.AccessRights which was held throughout the whole process easily blocking other sessions. Yet in current releases this lock duration has been greatly reduced, now other requests will only be blocked for a short period. Some additional RowHash locks on system tables usually don't interfere with other requests, but might block backups.

Alter doesn't use Spool, it just moves blocks on a cylinder level, Merge directly merges the source rows into the target table (when source and target share the same [P]PI). But InsSel always needs to spool the source data, of course this is especially bad for large tables when explain shows "The result spool file will not be cached in memory".

Keeping a copy of the original table is often regarded as an advantage of Merge and InsSel ("just in case"), but when you're constraint on permspace you might prefer Alter's low overhead of a few megabytes per AMP.

However the biggest advantage of Alter is its simplicity, just submit "ALTER TABLE tab ADD new_col int, ADD existing_col COPRESS ('bla');", that's it.

Compare this to all the additional steps needed for InsSel or Merge. It's not only CREATE/DROP/RENAME, all those COMMENTs, GRANTs, COLLECT STATS must be scripted before and then reapplied, too. Maintaining Referential Integrity might be complicated when the table is referenced in a Foreign Key. And to speed up processing the target table will be created with the Primary Index only, any additional index must be recreated subsequently.

Resource usage and runtime

I'm not showing exact numbers because your mileage may vary, but for tables without secondary indexes the CPU/IO scoring is usually:

  1. Alter Table
  2. Merge Into
  3. Insert Select

In my test cases (no change of [P]PI) Merge needed almost twice the CPU and IO of an Alter and InsSel added another 20%.

When Secondary/Hash/Join indexes exist InsSel gets closer to Merge but the gap to Alter increases drastically: Alter still needs to modify only the base rows instead of re-building all the indexes.

Runtime differences should be similar to CPU/IO, but they will vary greatly amongst systems due to different bottlenecks and you should run some tests on your own system.

Conclusion

I would strongly recommend implementing Alter Table, at least start considering it. If you're concerned about availability you should bear in mind that this process will probably be scheduled out of business hours anyway.

When you want the safeness of a copy of the old table you should definitely prefer Merge Into over good ol' Insert Select, only if you need to change the PI or partitioning of a table there's not much difference between both.

10 Comments
Enthusiast

This is something we struggle with deciding how to roll changes into tables multiple terabytes and hundreds of columns wide. Often the business would like us to keep a backup of the table until the change has been verified so I typically elect to use MERGE. One thing that I stumbled across is the ability to copy statistics from one table to another.

So my typical approach to applying table changes to a large production table is as follows:

  1. CREATE TABLE {NewTable} AS {OriginalTable} WITH NO DATA AND STATS; -- Stats will exist but be empty (TD 13.10)
  2. COLLECT STATS ON {NewTable} FROM {OldTable}; -- Stats now populated.
  3. ALTER TABLE /* Apply DDL changes */
  4. MERGE INTO 
  5. Perform validation and cleanup steps.

For the most part I have the luxury that space to store a redundant copy of the table has not been an issue. It's always a gamble to determine if this table restatement is faster than simply applying the change(s) via ALTER but when a backup of the table is required it almost becomes a moot point.

Thanks for the write-up though because it helps clarify things and provide options on how to manage change within your data model. 

N/A

Thnk for the useful information. I was thinking about this in the past but i never had strong arguments to use it.

Junior Contributor

Hi Rob,

thanks for your supplementary notes.

Of course COLLECT STATS FROM avoids recollecting, it was one of the TD13 features i liked best, but maybe it's less well-know.

Dieter

Enthusiast

Dieter, are there any concerns about physical data fragmentation when perfoming an ALTER  on a large table? In past versions we have been advised against ALTER statements for this reason.

Enthusiast

Thanks a lot dieter & rob  for such a nice information.

Just want to clear one doubt in the steps mentioned by rob above-->

In case , we miss to perform the step 2  & simple execute  refresh stats script in the end.

Collect stats on new_tab ;

Is it equivalent or are we missing some stats info. here.?

Thanks in advance!

Cheers!

Nishant

Junior Contributor

Hi Nishant,

when you define NO DATA the stats definitions are copied, but the actual values are set to zero rows.

Now when you re-collect the stats you need resources to do it, while a COLLECT FROM simply copies the existing stats without any overhead. So you better do step #2 any time before you drop the original table :-)

Enthusiast

Hello Dieter,

When you run an alter statement on a very large non empty table to extend its current ppi date range, do you expect such an alter statement to run quickly without any transient journalling or rollback concerns(in case of an abort)?

The case in point is a 3TB table and I need to extend the current date range of this table(which is 31-dec-2015 )all the way to 31-12-3999.

Regards,

Suhail

Junior Contributor

Hi Suhail,

there might be one specific case where this is slow:

NO RANGE is defined and there are lots of rows in it. All rows matching the new partitioning will automatically be moved to the new partitions and deleted from NO RANGE using Transient Journal.

Otherwise it's just a simple modification of the system tables to update to the new definition.

Enthusiast

ok thanks for the confirmation Dieter. I reviewed a sample explain plan and verified that too.

-Suhail

I'm starting to use ALTER's more and more, particularly where the Column naturally resides at the end of the Table.

Interesting point about the Locking level required.

Oh by the way, for NOT NULL Columns I tend to use a Default value (Blank on Character based Columns) as part of the Alter Statement.