One of the biggest benefits of Parallel Transporter is the ability to scale the load process on a client load server to circumvent performance bottlenecks. This can sometimes offer huge performance gains as compared to the legacy Teradata Stand-alone load tools such as FastLoad and MultiLoad.
In figure 1, the Stand-alone load tool has a single process that is either reading data from only a single data source or it is feeding the data it read into buffers that are sent to the Teradata Database through multiple sessions.
In contrast, the Parallel Transporter job on the right is using multiple processes to read the data asynchronously and pass the data to a buffer in memory referred to as the data stream. Asynchronously, multiple processes (Load Operators) are reading the data from the data stream and filling up buffers that are sent to the Teradata Database over multiple sessions.
The architectural advantages of Parallel Transporter are multiple processes running asynchronously (functional parallelism) used to scale around CPU bottlenecks and parallel data flows (data parallelism) used to scale around I/O bottlenecks.
The best architecture for performance is one that is scalable. Just as the Teradata Database allows for linear performance gains when a node is added, Parallel Transporter has a scalable architecture to get around performance bottlenecks on the client load server. This means that whatever the bottleneck may be (and there is always one area that is the current bottleneck) one can scale to move the bottleneck to a different point until one achieves either the service level agreement load objectives or hits some limitation that can not be overcome due to budget or other restrictions.
For example, some bottlenecks in a load job can be reading the input data files (I/O), processing the input data (CPU), sending the data over the network, and the Teradata Database. If the bottleneck is the Teradata Database, the problem can be solved by some performance tuning, workload management, or adding nodes to scale around the bottleneck. If the bottleneck is the network, then one has to increase the bandwidth between the load server and the Teradata Database. The number of sessions used in a Teradata load tool job is used to help scale the load across the network. If the bottleneck is reading the data files or processing the data, then Parallel Transporter (unlike the old legacy stand-alone Teradata load tools) can be scaled for data and functional parallelism to overcome the bottleneck. Again, scaling to avoid any one of these bottlenecks will just move the bottleneck to another point, but you can do this until you hit the service level agreement for the latency required in the load application.
Scalability & Performance Features of Parallel Transporter
Below are some of the features of Parallel Transporter that can be used to increase load throughput:
Performance with API & Script Interface
The scalable performance benefit with Parallel Transporter can be leveraged using either the script interface as shown in figure 1 or with one of the major ETL tools on the market which use the TPT API interface shown in figure 3. The TPT API interface is a low level interface created especially for ETL tools to access the Teradata load protocols through a programmatic interface.
Loading Performance Considerations
Parallel Transporter offers many improvements over the Teradata Stand-alone load utilities and the potential load throughput gains are one of the major advantages. In combination, the architecture of Parallel Transporter and the Teradata Database allows one to scale past any data load bottleneck and help one meet the performance levels of their data loading requirements.
I just changed the select statement in the example file to my own table, and I get this error:
Teradata Parallel Transporter Version 15.10.00.04 64-Bit
Job log: /opt/teradata/client/15.10/tbuild/logs/qstart2-1.o
Job id is qstart2-1, running on spark-analytics
Teradata Parallel Transporter Export Operator Version 15.10.00.04
Teradata Parallel Transporter DataConnector Operator Version 15.10.00.04
$EXPORT: private log specified: exportprivate.log
$FILE_WRITER: Instance 1 directing private log report to 'dtacop-ubuntu-14018-1'.
$FILE_WRITER: DataConnector Consumer operator Instances: 1
$FILE_WRITER: ECI operator ID: '$FILE_WRITER-14018'
$FILE_WRITER: Operator instance 1 processing file 'flatfile2.dat'.
$EXPORT: connecting sessions
TPT_INFRA: TPT02639: Error: Conflicting data type for column(4) - "dt". Source column's data type (CHAR) Target column's data type (INTDATE).
$EXPORT: TPT12108: Output Schema does not match data from SELECT statement
$EXPORT: disconnecting sessions
$EXPORT: Total processor time used = '0.04642 Second(s)'
$EXPORT: Start : Sat Apr 9 00:57:08 2016
$EXPORT: End : Sat Apr 9 00:57:11 2016
$FILE_WRITER: Total files processed: 0.
Job step MAIN_STEP terminated (status 12)
Job qstart2 terminated (status 12)
Job start: Sat Apr 9 00:57:06 2016
Job end: Sat Apr 9 00:57:11 2016
THis is my qstart2.txt:
DEFINE JOB qstart2
APPLY TO OPERATOR ($FILE_WRITER)
SELECT * FROM OPERATOR($EXPORT);
This is my jobvars2.txt (with credentials replaced):
SourceTdpId = 'db.acmecorp.com'
,SourceUserName = 'myusername'
,SourceUserPassword = 'mypass'
,DDLPrivateLogName = 'ddlprivate.log'
,ExportPrivateLogName = 'exportprivate.log'
,TargetErrorList = ['3807']
,TargetFileName = 'flatfile2.dat'
,TargetFormat = 'delimited'
,TargetTextDelimiter = ','
,TargetOpenMode = 'write'
,SelectStmt = 'SELECT * FROM rtl.base;'