Teradata Parallel Transporter #2 – Benefits of Scalable Performance

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Teradata Parallel Transporter #2 – Benefits of Scalable Performance

One of the biggest benefits of Parallel Transporter is the ability to scale the load process on a client load server to circumvent performance bottlenecks.  This can sometimes offer huge performance gains as compared to the legacy Teradata Stand-alone load tools such as FastLoad and MultiLoad.

Architecture

In figure 1, the Stand-alone load tool has a single process that is either reading data from only a single data source or it is feeding the data it read into buffers that are sent to the Teradata Database through multiple sessions.

In contrast, the Parallel Transporter job on the right is using multiple processes to read the data asynchronously and pass the data to a buffer in memory referred to as the data stream. Asynchronously, multiple processes (Load Operators) are reading the data from the data stream and filling up buffers that are sent to the Teradata Database over multiple sessions.

The architectural advantages of Parallel Transporter are multiple processes running asynchronously (functional parallelism) used to scale around CPU bottlenecks and parallel data flows (data parallelism) used to scale around I/O bottlenecks.

Scalability Benefits

The best architecture for performance is one that is scalable. Just as the Teradata Database allows for linear performance gains when a node is added, Parallel Transporter has a scalable architecture to get around performance bottlenecks on the client load server. This means that whatever the bottleneck may be (and there is always one area that is the current bottleneck) one can scale to move the bottleneck to a different point until one achieves either the service level agreement load objectives or hits some limitation that can not be overcome due to budget or other restrictions.

For example, some bottlenecks in a load job can be reading the input data files (I/O), processing the input data (CPU), sending the data over the network, and the Teradata Database. If the bottleneck is the Teradata Database, the problem can be solved by some performance tuning, workload management, or adding nodes to scale around the bottleneck. If the bottleneck is the network, then one has to increase the bandwidth between the load server and the Teradata Database. The number of sessions used in a Teradata load tool job is used to help scale the load across the network. If the bottleneck is reading the data files or processing the data, then Parallel Transporter (unlike the old legacy stand-alone Teradata load tools) can be scaled for data and functional parallelism to overcome the bottleneck. Again, scaling to avoid any one of these bottlenecks will just move the bottleneck to another point, but you can do this until you hit the service level agreement for the latency required in the load application.

Scalability & Performance Features of Parallel Transporter

Below are some of the features of Parallel Transporter that can be used to increase load throughput:

  • Parallel input file processing
  • Sometimes splitting the I/O processing up over multiple input files will help overcome a performance bottleneck
  • Directory scan feature – A single Parallel Transporter script can have multiple processes read multiple files in multiple directories
  • Multiple processes reading the same input file – in some cases having multiple read processes reading the same file can increase data throughput
  • Multiple processes sending data to the Teradata Database – performance can be improved by splitting the CPU time across multiple processes that are manipulating the data and then loading buffers to be sent to the Teradata Database
  • Multiple processes exporting the same table

Performance with API & Script Interface

The scalable performance benefit with Parallel Transporter can be leveraged using either the script interface as shown in figure 1 or with one of the major ETL tools on the market which use the TPT API interface shown in figure 3. The TPT API interface is a low level interface created especially for ETL tools to access the Teradata load protocols through a programmatic interface.

Loading Performance Considerations

  • Considerations in meeting the load service level agreements
  • Optimize to hit a peak load rate
  • Optimize load latency so a table is available to query by a certain time
  • Sometimes there are limits on the maximum capacity consumption allowed on the various load server systems
  • Load strategy considerations & tradeoffs:
  • Maximize throughput
  • Minimize resource consumption
  • Potential limitations in integrating with operational infrastructure
  • Bottleneck analysis
  • Throughput limited by current effective bottleneck
  • Analyze the client load tool I/O, CPU consumption, network capacity, and the Teradata Database capacity and contention

Conclusion

Parallel Transporter offers many improvements over the Teradata Stand-alone load utilities and the potential load throughput gains are one of the major advantages. In combination, the architecture of Parallel Transporter and the Teradata Database allows one to scale past any data load bottleneck and help one meet the performance levels of their data loading requirements.

3 REPLIES
Teradata Employee

Re: Teradata Parallel Transporter #2 – Benefits of Scalable Performance

Access to old MLOAD Variables from TPT UPDATE??

In our current mload processes, we capture row count statistics using system variables available and insert these stats into a table within the MLOAD JOB Definition. We would like to know if these variables can be captured after a TPT Update process is completed and populate a similar table with these stats. Some of the variables that we use below:

&SYSRCDCNT,
&SYSRJCTCNT,
&SYSAPLYCNT,
&SYSNOAPLYCNT,
&SYSINSCNT,
&SYSUPDCNT,
&SYSDELCNT,
&SYSETCNT,
&SYSUVCNT,
&SYSRC

The Statement we use is a simple INSERT/SELECT and reference the variables.

JK - Swift Transportation
Teradata Employee

Re: Teradata Parallel Transporter #2 – Benefits of Scalable Performance

To whom it may concern (Mark?) - We've reviewed the Reference guide for TPT and found that these variables are not supported for the TPT UPDATE operator - shame!!
Enthusiast

Re: Teradata Parallel Transporter #2 – Benefits of Scalable Performance

I just changed the select statement in the example file to my own table, and I get this error:

ubuntu@home:/opt/teradata/client/15.10/tbuild/sample/quickstart$ tbuild -f qstart2.txt -v jobvars2.txt -j qstart2

Teradata Parallel Transporter Version 15.10.00.04 64-Bit

Job log: /opt/teradata/client/15.10/tbuild/logs/qstart2-1.out

Job id is qstart2-1, running on spark-analytics

Teradata Parallel Transporter Export Operator Version 15.10.00.04

Teradata Parallel Transporter DataConnector Operator Version 15.10.00.04

$EXPORT: private log specified: exportprivate.log

$FILE_WRITER[1]: Instance 1 directing private log report to 'dtacop-ubuntu-14018-1'.

$FILE_WRITER[1]: DataConnector Consumer operator Instances: 1

$FILE_WRITER[1]: ECI operator ID: '$FILE_WRITER-14018'

$FILE_WRITER[1]: Operator instance 1 processing file 'flatfile2.dat'.

$EXPORT: connecting sessions

TPT_INFRA: TPT02639: Error: Conflicting data type for column(4) - "dt". Source column's data type (CHAR) Target column's data type (INTDATE).

$EXPORT: TPT12108: Output Schema does not match data from SELECT statement

$EXPORT: disconnecting sessions

$EXPORT: Total processor time used = '0.04642 Second(s)'

$EXPORT: Start : Sat Apr  9 00:57:08 2016

$EXPORT: End   : Sat Apr  9 00:57:11 2016

$FILE_WRITER[1]: Total files processed: 0.

Job step MAIN_STEP terminated (status 12)

Job qstart2 terminated (status 12)

Job start: Sat Apr  9 00:57:06 2016

Job end:   Sat Apr  9 00:57:11 2016

 

THis is my qstart2.txt:

DEFINE JOB qstart2

(

  APPLY TO OPERATOR ($FILE_WRITER)

  SELECT * FROM OPERATOR($EXPORT);

);

 

This is my jobvars2.txt (with credentials replaced):

 SourceTdpId          = 'db.acmecorp.com'

,SourceUserName       = 'myusername'

,SourceUserPassword   = 'mypass'

,DDLPrivateLogName    = 'ddlprivate.log'

,ExportPrivateLogName = 'exportprivate.log'

,TargetErrorList      = ['3807']

,TargetFileName       = 'flatfile2.dat'

,TargetFormat         = 'delimited'

,TargetTextDelimiter  = ','

,TargetOpenMode       = 'write'

,SelectStmt           = 'SELECT * FROM rtl.base;'