Teradata Parallel Transporter (TPT) is a flexible, high-performance Data Warehouse loading tool specifically optimized for Teradata Database, which enables data extraction, transformation and loading. It incorporates an infrastructure, which provides a parallel execution environment for product components called “operators”, which integrate with the infrastructure in a "plug-in" fashion and are thus interoperable.
Teradata PT also consists of a separate component, the Teradata PT Application Programmer Interface (abbreviated as TPTAPI), which allows users to invoke some of the Teradata PT operators directly. However, the features presented in this article can only be accomplished through the scripting tool provided by Teradata PT, not the TPTAPI unless users chose to code the features themselves.
Operators provide access to external resources, e.g., files, DBMS tables, Messaging Middleware products, etc., and perform various filtering and transformation functions. The Teradata PT infrastructure incorporates a high performance data transfer mechanism called the Data Stream, which is used for interchanging data between the operators. As shown in Figure 1, a typical Data Stream connects two operators: one of them, known as the “producer”, writes the data to the Data Stream, and another one, known as the “consumer”, reads the data from the Data Stream. Note that operators connected by the Data Stream must share the same schema (i.e., data layout and format) for the data to be interchanged.
Figure 1: Operators and Data Streams
The Teradata PT infrastructure also makes it possible to launch multiple instances of a given operator. Each instance is typically assigned to a system process or thread, and therefore, can execute independently of, and concurrently with, the other instances. To enable workload balancing amongst multiple instances of a consumer operator, the input Data Stream can be split so that each operator instance receives a portion of the overall data. Conversely, output Data Streams produced by multiple instances of a producer operator can be merged to form a single Data Stream, which may also be split at some later point.
Figure 2 shows a typical database loading scenario: multiple instances of the producer operator collect data from multiple sources and write it to their respective output Data Streams. The latter are merged and then split across multiple instances of the consumer operator. It is important to note that in this case all instances of the consumer operator are loading data into a single database target, for example, the same DBMS server.
Figure 2: Multiple operator instances and Data Streams
Figure 2 also shows a set of Teradata PT-provided operators for users to carry out a variety of ETL operations:
• The Load Operator loads data into new tables using a high-performance, multi-session FastLoad protocol.
• The Update Operator loads data and applies updates to both new and existing tables by using the high-performance, multi-session Multiload protocol. The updates can be applied either conditionally or unconditionally based on user-defined rules.
• The Export Operator extracts data from Teradata tables using the high-performance, multi-session FastExport protocol.
• The Stream Operator loads data and applies updates to new and existing tables by using multiple SQL protocol sessions in a high-performance, workload-balancing manner. The updates can be applied either conditionally or unconditionally based on user-defined rules. This operator can be used for loading data from continuous data sources such as queuing systems (e.g., MQ Series), EAI products, etc.
• The SQL Inserter Operator loads data into new and existing tables using a single SQL protocol session.
• The SQL Selector Operator extracts data from existing tables using a single SQL protocol session.
• The ODBC Operator extracts data extracts data from external non-Terdatada ODBC sources. To extract from Teradata Database, use the Export Operator instead.
• The Data Connector Operator allows simultaneous, parallel reading of multiple data sources such as various types of files, queuing systems, etc. It also allows writing to external data sources, such as files.
Teradata PT capabilities can be divided into the following broad categories:
• Scalability and Parallelism
Through these capabilities, Teradata PT addresses many problems frequently encountered when introducing data to the Data Warehouse.
The use of multiple parallel instances of operators, as well as operator multi-session capability, enable Teradata PT jobs to be scaled-up to maximize the use of available CPU resources on their ETL client hosts. This results in shorter job run times, if either I/O or CPU is the bottleneck from the load tool processing.
The use of Data Streams instead of intermediate files, and the application of a single Data Stream to multiple data targets, can greatly reduce the amount of file I/O that would otherwise be required to accomplish many ETL objectives. The use of these features also increases throughput as mentioned above, and reduce or eliminate disk space requirements for intermediate files.
In traditional Teradata utilities, which rely on a single system process to perform data extraction and loading, the single process could reach a threshold beyond which throughput cannot increase. One of the most critical limiting factors in this case is CPU speed and availability.
Teradata PT, due to its parallel and scalable architecture, makes it possible to utilize additional CPU processing power, shorten the load process, and reduce the overall ETL execution time. The number of operator instances is specifiable by the user, affording control over the scalability and performance of the Data Warehouse loading process. In addition, Teradata PT allows data extraction and loading processes to run completely asynchronously to each other. This allows for broader parallelism that improves performance even further.
Teradata PT employs a single script language for specifying ETL operations to be performed by a job. It is based on Teradata Utility Script Languages and ANSI SQL, which emphasizes the goal rather than the method (“what” vs. “how”).
This language allows specification and seamless integration of ETL processes. It also utilizes extensible, reusable metadata objects such as virtual tables and attributes for mapping data sources and targets. This can significantly simplify the creation and administration of ETL processes. Finally, it supports direct control of scalability, which allows users to monitor and adjust performance based on system resources.
Teradata PT Jobs can be directed to take periodic checkpoints, or if periodic checkpoints are not specified, the Teradata PT takes two default checkpoints at the start and the end of data acquisition. If a Teradata PT job is interrupted because of a Teradata Database restart or errors that can be retried (e.g. deadlocks), it automatically restarts from the last checkpoint taken. Jobs interrupted for any other reason can be manually restarted from the last checkpoint taken.
Although Teradata PT can help simplify the data extraction and loading process, in many cases the original data cannot be loaded into the Data Warehouse directly. Instead, it must be pre-processed. The reasons include, but are not limited to, the following:
• The source contains more data than needs to be loaded
• The data needs to be validated
• The data needs to be transformed
• The format of the data is otherwise unsuitable for loading
A special data transformation tool or a user-written program typically performs the required pre-processing. In many cases this may lead to the creation of a separate staging file, rendering the load process more complex and requiring additional I/O.
Teradata PT can perform data modification that can be specified directly in the job script. The specific data modification capabilities are as follows:
• The source data can be filtered via the SQL-compatible WHERE condition. With this facility it is possible to ensure that the extracted data meets certain validation criteria. It is also possible to only select the data that needs to be loaded.
• The source data can be re-arranged using the SQL-compatible SELECT statement. This statement makes it possible to select only the data fields that need to be loaded and ignore the ones that don’t. This can significantly reduce the overall volume of data that is being loaded. Also, the SELECT statement allows specification of “derived values” (i.e., values that are not present in the original data source but instead are determined dynamically) in the process of loading. These derived values can be calculated either conditionally or unconditionally. Please refer to the Teradata Parallel Transporter Reference for more information on performing the above operations.
These modifications are performed in-line (i.e., the data doesn’t need to be landed to an intermediate file), thereby eliminating redundant I/O. The data is sent directly from the data extraction operator to the modification component that forwards it to the loading operator. The entire operation is performed asynchronously.
Seamless Integration of Various Operators within the Same Script
As shown in Figure 2, many of the operators can be seamlessly combined in various ways within the same script to provide solutions to a wide variety of ETL problems. Because operators are modular and reusable, they can be combined and interchanged with only minor adaptations to the script.
The following ETL operations are common:
• Loading data from multiple sources (e.g., multiple files or tape volumes) and into multiple targets (e.g. multiple Teradata databases)
• Combining data from dissimilar sources (e.g., ODBC sources, MQ Series, files with different physical and/or logical formats)
• Loading and updating large amounts of data in a restartable manner
• Transforming the data before loading
• Batch and active data extraction and loading in a single job
A Teradata PT job can contain multiple processing steps. Initialization activities, such as creating target and work tables, are specified at the beginning of the job. Termination activities, such as dropping work tables or handling error tables, are specified at the end of the job. A job can contain as many ETL steps as desired, either to implement complex ETL scenarios or simply to group related ETL processes into a single job.
Data extracted or exported from data sources (by "producer" operators) are fed directly to load and update operators ("consumer" operators) without the use of any intermediate files. This can greatly reduce the amount of file I/O and significantly improve performance due to shorter data path lengths, when compared to the same ETL process accomplished by a combination of the traditional Teradata utilities. When multiple operator instances are specified, producer operators are connected to consumer operators via a multiplexed network of Data Streams that the Teradata PT manages in a load-balancing fashion for maximum efficiency.
While multiple instances of operators can facilitate scalable access to data (e.g. files with the same schema), the Teradata PT’s UNION ALL operation allows the same data source to be accessed in “partitions” using different criteria and specifications for each of the partitions. For example, a relational table can be accessed by the ODBC operator with different “key ranges” so that row sets that satisfy such key ranges can be merged, as long as they have the same schema (i.e. are UNION-compatible).
Teradata PT also allows dissimilar sources to be read in parallel. For example, a Teradata PT job can simultaneously read data from different relational sources or different files from different media (e.g. disks and tapes). This is in contrast with the single-threaded utility, which usually requires additional steps for combining multiple sources into some intermediate file (the so-called “staging file”) before data can be loaded into the target table. Teradata PT makes it possible to avoid creation of staging files and not only simplifies the load process but also makes it faster by eliminating redundant I/O.
For those cases where the data modification requirements exceed those that are directly provided by the Teradata PT infrastructure, custom-written modification components can be integrated and invoked from job scripts. These user-written components are called “filter operators” and can contain arbitrary business logic. Please refer to the Teradata PT Operator API for information on how to build filter operators.
Some Teradata users have made significant investments in creating customized INMODs, OUTMODs, and Access Modules for use with Teradata utilities. These modules allow users to perform processing that falls outside the capabilities provided by the utilities. Rather than requiring these users to re-implement the business logic contained in such INMODs, OUTMODs, and Access Modules, Teradata PT provides an adapter that makes these modules usable within the Teradata PT framework. This preserves the users’ investments and simplifies the transition to Teradata PT.
As indicated in previous sections, Teradata PT provides very flexible, high-performance Load and Unload facilities that enable population of the Data Warehouse. Teradata PT improves ETL performance through parallel execution while providing a uniform interface to all its functions. It supports many functions of the traditional Teradata utilities and many new, advanced functions for scalability and parallelization. It also facilitates Active Data Warehousing (ADW) by allowing continuous access to data sources and data targets through the Data Connector and Stream operators, a feature similar to the Teradata TPump utility.
In the upcoming Teradata PT articles, more features will be discussed in more detail and examples of how they are applied in real-world scenarios will be included. All these features not only exploit the power of scalable performance and parallelism for batch-oriented ETL processes, but also lay the foundation for the parallel and scalable real-time transactional updates commonly found in Active Data Warehousing.