Teradata Parallel Transporter Active and Batch Directory Scans

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Teradata Parallel Transporter Active and Batch Directory Scans

With traditional Teradata utilities such as Fastload, Multiload, and TPump, multiple data files are usually processed in a serial manner. For example, if the data to be loaded into the Data Warehouse reside in several files, they must be either concatenated into a single file before data loading or processed sequentially on a file-by-file basis during data loading.

In contrast, Teradata Parallel Transporter (TPT) provides a feature called “directory scan” which allows data files in a directory to be processed in a parallel and scalable manner as part of the loading process. In addition, if multiple directories are stored across multiple disks, a special feature in TPT called “UNION ALL” can be used to process these directories of files in parallel, thus achieving more throughput through scalability and parallelism across disks.

Figure 1 illustrates an ETL scenario where data in multiple directories can be extracted through the Data Connector operator, transformed using the SQL SELECT, CASE, and WHERE logic, and loaded into Teradata tables through the Update (Multiload) operator.  



         Figure 1: Batch directory scan

Directory scans also provide an option for users to select files for processing based on file names, which include wildcard specifications. The Data Connector operator provides scalable and parallel access to multiple files in a load-balancing manner. By “load-balancing”, we mean specifically the files are distributed as evenly as possible amongst operator instances. Note even though the distribution is at the file level, the Data Connector operator takes the file sizes into the formula for distribution and load-balancing.

There are two types of directory scans that TPT provides today, the "batch directory scan" and the "active directory scan". The batch directory scan is mainly used for traditional data warehousing where files of data are usually collected from different locations or sources and are stored in one or more directories. Once the so-called “batch window” opens, TPT jobs start processing these directories, and no new files are added to these directories while the jobs are running.

Active directory scan is similar to the batch directory scan except that active directory scan allows data files to be continuously added to a scanned directory while the job is running. This type of processing can provide 24 x 7 processing of data files, which usually represent daily transactions that occur in different timeframes from multiple sources or locations. There can be multiple scans of the directory based on the user-defined time interval specified via the "VigilWaitTime" attribute. For example, if the user specifies 5 seconds as the VigilWaitTime, then every 5 seconds, the Data Connector operator would wake up and scan the directory for new files to be processed. If none of the files in the directory matches the FileName criteria during the scan, the Data Connector operator would be idled until the next interval (5 seconds) expires, then it would scan the directory again for new files, and so on until the job ends. Additionally, users can request to run the job for as long as they want using the VigilStartTime and VigilStopTime attributes. VigilStartTime defines the time to start the job and VigilWaitTime defines the time to stop the job.

Figure 2 illustrates a typical active directory scan scenario where transactional data can be collected into a set of files before they are sent to one or more directories for data loading. During data loading, the Data Connector operator processes each of the records in the files that were collected within a time interval. Once the Data Connector operator finishes processing these files, it issues a checkpoint and then archives the files to a user-defined directory before the next scan of new files starts. As such, transactional data can be periodically committed to Teradata through checkpoints driven by the Data Connector operator.  



        Figure 2: Active directory scan

Below is a summary of the differences between the active directory scan and batch directory scan:

Batch Directory Scan:

  • There is only one scan of the directory per job for picking up the files to be loaded based on the FileName criteria.
  • No new files should be added to the directory once the Data Connector operator starts scanning/processing the directory.
  • Interval checkpointing (i.e. the tbuild -z option) can be used to commit rows to Teradata periodically. That is, the committing of rows is based on the time interval, not at the row or file level.
  • The "ArchiveDirectoryPath" attribute shouldn't be used since there is only one scan to the directory. After that scan is done and rows have been loaded to Teradata tables, the job would exit.
  • The batch directory scan is usually used for fast loading files into staging tables using the Load operator (or sometimes the Update operator). Almost all of the TPT customers today are using this feature for loading their files due to the throughput they can get through parallelism and scalability

Active Directory Scan:

  • There can be multiple scans of the directory based on the user-defined time interval specified via the VigilWaitTime attribute.
  • Users can request to run the job for as long as they want using the VigilStartTime and VigilStopTime attributes.
  • New files can be added to a directory continuously while the job is running.
  • Files would be automatically archived once they have been processed and a checkpoint is taken. The checkpoint is for committing rows of these files to Teradata before the scan for new files starts.
  • This type of directory scan is usually used for loading transactional files in real-time and the Stream (TPump) operator is usually used for performing INSERT/DELETE/UPDATE against one or more Teradata tables.

Batch Directory Scan Sample Script

DEFINE JOB BATCH_DIRECTORY_SCAN
DESCRIPTION 'BATCH DIRECTORY SCAN'
(

/*********************************** Schema Definition ****/

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
IN_RECV_DT VARCHAR(10),
IN_RTL_CO_NUM VARCHAR( 6),
IN_STR_NUM VARCHAR(11),
IN_RECV_TMSTMP VARCHAR(23),
IN_RECV_TYPE VARCHAR( 4),
IN_RECV_DOC VARCHAR(10),
IN_USER_ID VARCHAR(11),
IN_ITEM_UPC_CD VARCHAR(15),
IN_RECV_QTY VARCHAR(11)
);

/******************************* Update Operator Definition **/
DEFINE OPERATOR UPDATE_OPERATOR ()
DESCRIPTION 'TERADATA UPDATE (MULTILOAD) OPERATOR'
TYPE UPDATE
SCHEMA *
ATTRIBUTES
(
INTEGER ErrorLimit = 1,
INTEGER BufferSize = 64,
INTEGER TenacityHours = 2,
INTEGER TenacitySleep = 1,
INTEGER MaxSessions = 4,
INTEGER MinSessions = 1,
VARCHAR TargetTable = 'dtac_081',
VARCHAR PrivateLogName = 'dtac_081.updlog',
VARCHAR TdpId = @TdpId,
VARCHAR UserName = @UserName,
VARCHAR UserPassword = @UserPassword,
VARCHAR AccountID,
VARCHAR ErrorTable1 = 'dtac_081_e1',
VARCHAR ErrorTable2 = 'dtac_081_e2',
VARCHAR WorkTable = 'dtac_081_wt',
VARCHAR LogTable = 'dtac_081_log',
VARCHAR WorkingDatabase = @WorkingDatabase
);

/************************** Reader Operator Definition ****/

DEFINE OPERATOR FILE_READER ()
DESCRIPTION 'FILE READER'
TYPE DATACONNECTOR PRODUCER
SCHEMA PRODUCT_SOURCE_SCHEMA
ATTRIBUTES
(
VARCHAR PrivateLogName = 'dtac_081.dtaclog',
VARCHAR FileName = 'INVRCV.*',
VARCHAR OpenMode = 'Read',
VARCHAR DirectoryPath, /* See APPLY statement */
VARCHAR ArchiveDirectoryPath, /* See APPLY statement */
VARCHAR IndicatorMode,
VARCHAR Format = 'Delimited',
VARCHAR TraceLevel = 'NONE'
);

/********************************* THE APPLY STATEMENT ****/

APPLY
(

'INSERT INTO dtac_081 (

:IN_RECV_DT
,:IN_RTL_CO_NUM
,:IN_STR_NUM
,:IN_RECV_TMSTMP
,:IN_RECV_TYPE
,:IN_RECV_DOC
,:IN_USER_ID
,:IN_ITEM_UPC_CD
,:IN_RECV_QTY

);'

)

TO OPERATOR ( UPDATE_OPERATOR[1] )

SELECT * FROM OPERATOR ( FILE_READER[2] ATTR( DirectoryPath = 'dir1',
ArchiveDirectoryPath = 'archive1' ) )

UNION ALL

SELECT * FROM OPERATOR ( FILE_READER[2] ATTR( DirectoryPath = 'dir2',
ArchiveDirectoryPath = 'archive2' ) );
);

Active Directory Scan Sample Script

DEFINE JOB ACTIVE_DIRECTORY_SCAN
DESCRIPTION 'ACTIVE (CONTINUOUS) DIRECTORY SCAN'
(

/*********************************** Schema Definition ****/

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
IN_RECV_DT VARCHAR(10),
IN_RTL_CO_NUM VARCHAR( 6),
IN_STR_NUM VARCHAR(11),
IN_RECV_TMSTMP VARCHAR(23),
IN_RECV_TYPE VARCHAR( 4),
IN_RECV_DOC VARCHAR(10),
IN_USER_ID VARCHAR(11),
IN_ITEM_UPC_CD VARCHAR(15),
IN_RECV_QTY VARCHAR(11)
);

/******************************* Stream Operator Definition **/
DEFINE OPERATOR STREAM_OPERATOR ()
DESCRIPTION 'TERADATA STREAM (TPUMP) OPERATOR'
TYPE STREAM
SCHEMA *
ATTRIBUTES
(
INTEGER ErrorLimit = 1,
INTEGER BufferSize = 64,
INTEGER TenacityHours = 2,
INTEGER TenacitySleep = 1,
INTEGER MaxSessions = 4,
INTEGER MinSessions = 1,
INTEGER Pack,
INTEGER PackMaximum,
VARCHAR Robust,
VARCHAR PrivateLogName = 'dtac_081.strlog',
VARCHAR TdpId = @TdpId,
VARCHAR UserName = @UserName,
VARCHAR UserPassword = @UserPassword,
VARCHAR AccountID,
VARCHAR ErrorTable = 'dtac_081_et',
VARCHAR LogTable = 'dtac_081_log',
VARCHAR WorkingDatabase = @WorkingDatabase
);

/************************** Reader Operator Definition ****/

DEFINE OPERATOR FILE_READER ()
DESCRIPTION 'FILE READER'
TYPE DATACONNECTOR PRODUCER
SCHEMA PRODUCT_SOURCE_SCHEMA
ATTRIBUTES
(
VARCHAR PrivateLogName = 'dtac_081.dtaclog',
VARCHAR FileName = 'INVRCV.*',
VARCHAR OpenMode = 'Read',
VARCHAR DirectoryPath = 'temp',
VARCHAR ArchiveDirectoryPath = 'archive',
VARCHAR IndicatorMode,
VARCHAR Format = 'Delimited',
VARCHAR TraceLevel = 'NONE',
INTEGER VigilStartTime,
INTEGER VigilStopTime,
INTEGER VigilWaitTime = 5, /* scan interval in seconds */
INTEGER VigilElapsedTime = 1 /* Duration of the job in minutes (or use VigilStartTime and VigilStopTime) */

);

/********************************* THE APPLY STATEMENT ****/

APPLY
(

'INSERT INTO dtac_081 (

:IN_RECV_DT
,:IN_RTL_CO_NUM
,:IN_STR_NUM
,:IN_RECV_TMSTMP
,:IN_RECV_TYPE
,:IN_RECV_DOC
,:IN_USER_ID
,:IN_ITEM_UPC_CD
,:IN_RECV_QTY

);'
)

TO OPERATOR ( STREAM_OPERATOR[1])
SELECT * FROM OPERATOR ( FILE_READER[2] );
);

1 REPLY
Enthusiast

Re: Teradata Parallel Transporter Active and Batch Directory Scans

Hi,

I am using tpt batch directory scan for loading. My question is that is possible to have log of each individual file loaded via bulk load? From what i get is that the log file has a count of total records sent to the RDBMs and not per file.

Thanks.