ETL Process questions

Database
Highlighted
Fan

ETL Process questions

We are looking at optimizing our DataWarehouse load process. Following is a core requirement for any table that we load:

 

  1. Data load must not impact existing users.
  2. New data loaded must be available to users only on a scheduled date.

Currently, we make copies of production tables, load/append new data and rename them on the scheduled date.

One drawback of this approach is that it requires additional space since we have to maintain multiple copies of tables. Many of our tables are very large – 5TB+

 

An approach I have taken in the past with Oracle (in a different project) is as follows:

Environment
==========

Database Engine: Oracle

Frequency of loading: Monthly

 

Approach
========

  1. Create a separate tables for each month. For example, FACT_[SUBJECT]_[201701], FACT_[SUBJECT]_[201702], …. FACT_[SUBJECT]_[201712]
  2. Create a view that involves UNION ALL between these tables
  3. Let’s say, we are loading data belonging to January, create a table FACT_[SUBJECT]_[201801]
  4. On the release date, include this table also as a part of the view.

 

This approach has couple of advantages:

  1. The space required is less since we don’t have to make copies of tables.
  2. If a query involves only data from specific base tables which are a part of the view, Oracle’s query optimizer was smart enough to go against just those tables (something similar to how optimizer works against table partitions)

Questions:

========

  1. Will this approach work with Teradata?
  2. Are there any other options?

 

We would love your feedback. Thanks.

 

3 REPLIES
Teradata Employee

Re: ETL Process questions

Fact tables of this size are commonly row-partiioned.  Partitioning doesn't mean the same thing in Teradata as in Oracle: rows in the same partition are stored together, but they are still spread over all the AMPs.  In other words, each parallel processor's storage has partitions within it.  This enables partition-level locking, the default locking level in Teradata DML on partitioned tables.

So if you defined a partitioning primary index on year-month, you could update a partition (perhaps in this case just insert rows to a predefined but empty partition) without locking the whole table.  There would be only one table and no need for a union view.  I'm just not sure if this lock would be escalated to a table lock during a mass insert such as insert-select or merge.  Maybe qualifying the insert with a predicate on the year-month would be enough to tell the optimizer you are just updating that partition, or maybe you would have to export the updates and use singleton inserts via TPT Stream.  There are others on this forum who could elaborate on that.

Fan

Re: ETL Process questions

Thank you. This might help with lock management. Another requirement we have is that users must not see the newly loaded data until it is officially released. Is there a way to disable and enable data in a given partition?

Junior Supporter

Re: ETL Process questions

Shielding the new data can easily be done by only providing access through a view, which includes a filter on the date.

Using the same Partitioning key for this filter makes it very efficient.

To make new data available, requires an update of the view.

Teradata Frank, Certified Master