IBM InfoSphere Information Server is a unified and comprehensive information integration platform. It profiles, cleanses, and transforms data from heterogeneous data sources to deliver consistent and accurate business data. IBM Information Server is an ideal solution to integrate and synchronize Teradata enterprise data with other ERP systems and enterprise applications
This article illustrates and compares several integration approaches within the IBM InfoSphere Information Server to deal with the Teradata data. It contains examples to demonstrate the business integration scenarios. These examples provide guides that show you how to solve typical data integration problems with Teradata databases.
As shown in Figure 1, The IBM InfoSphere Information Server can be viewed as having four logical tiers:
Figure 1. IBM InfoSphere Information Server logical tiers
Each tier defines a logical group of software modules that can be mapped to a physical piece of hardware. The tiers can be installed separately on different computers or on the same computer. The Information Server supports deploying the engine tier on a symmetric multiprocessing (SMP) computing platform or on a massive parallel processing (MPP) computing platform to achieve high scalability and performance.
The InfoSphere Information Server requires the following components to support data integration with Teradata databases:
The Teradata Connector leverages new Teradata features. It operates in either immediate access mode or in bulk mode:
The Teradata Connector is available on IBM InfoSphere Information Server Version 8.0.1 and later. It is designed as a single solution to replace all the Teradata legacy stages.
The Teradata legacy stages include:
The Teradata legacy stages are available on Information Server Version 7.5 and later.
This article does not cover the stored procedure (STP) stage and the Open Database Connectivity (ODBC) stage in detail. The IBM InfoSphere Information Server uses these stages to provide support for many database types:
Table 1 shows suggested options to select based on your use case. The concepts of the sparse and normal lookups are fully explained in the Look up Teradata data section.
Table 1. Teradata integration options
Use case | Suggested Solution | Legacy Options | Limitations |
---|---|---|---|
Low volume Teradata data read | Teradata Connector: immediate mode, SQL | Teradata API Stage | |
Low volume data insert/update/ upsert/delete | Teradata Connector: Immediate mode, SQL Bulk mode, stream driver | Teradata API Stage Teradata Multiload Stage | |
Realtime data Insert/update/ upsert/delete | Teradata Connector: Immediate mode, SQL Bulk mode, stream driver | Teradata API Stage Teradata Multiload stage | |
Bulk data load to empty table | Teradata Connector: Bulk mode, load driver | Teradata Enterprise Stage Teradata Load Stage | Target table: No secondary indexes No referential integrity No triggers No multiset |
Bulk load/update to existing Table Bulk data load to empty table with no unique secondary indexes | Teradata Connector Bulk mode, update driver | Teradata Multiload Stage | Target table: No Unique Secondary Indexes No referential integrity No triggers |
Bulk data export | Teradata Connector Bulk mode, export driver | Teradata Enterprise Stage Teradata Multiload Stage | |
Normal Lookup | Teradata Connector Bulk mode, export driver Immediate mode, SQL | Teradata Enterprise Stage Teradata Multiload Stage Teradata API Stage | |
Sparse Lookup | Teradata Connector Immediate mode, SQL | Teradata API Stage | |
Multiple input links and transaction support | Teradata Connector Immediate mode | ||
Calling Teradata stored procedures, macros, scalar functions, and table functions | Stored Procedure Stage (STP) |
This section uses a sample ETL job to illustrate the steps to use the Teradata connector to load data into an empty Teradata table. Figure 2 shows the sample job. The job reads the orders from a flat file. It transforms and passes the source data to the Teradata connector named LoadDataUsingBulkLoad. The connector loads the data into an empty table named Orders using the Teradata parallel transporter load driver. Data records that violate the database constraints are rejected by the connector and forwarded to a flat file.
Figure 2. Load data into the Teradata Orders table
Figure 3 shows the sample source data.
Figure 3. Sample source data (5,000 rows)
(See a larger version of Figure 3.)
Listing 1 shows the SQL for creating the Teradata database Orders table.
Listing 1. SQL to create the Orders table
CREATE SET TABLE Orders ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
OrderID INTEGER NOT NULL,
CustomerID VARCHAR(5) CHARACTER SET LATIN CASESPECIFIC,
EmployeeID INTEGER,
OrderDate TIMESTAMP(0),
RequiredDate TIMESTAMP(0),
ShippedDate TIMESTAMP(0),
ShipVia INTEGER,
ShipAddress VARCHAR(60) CHARACTER SET LATIN CASESPECIFIC,
ShipCity VARCHAR(15) CHARACTER SET LATIN CASESPECIFIC,
ShipRegion VARCHAR(15) CHARACTER SET LATIN CASESPECIFIC,
ShipPostalCode VARCHAR(10) CHARACTER SET LATIN CASESPECIFIC,
ShipCountry VARCHAR(15) CHARACTER SET LATIN CASESPECIFIC)
UNIQUE PRIMARY INDEX ( OrderID );
Two main steps are required to set up the LoadDataUsingBulkLoad Teradata connector for the bulk data load operation:
This section contains screenshots that illustrate the steps to import the definition of the Orders database table.
This section contains screenshots that illustrate the steps to define the bulk load operation using the Teradata Parallel Transport load driver.
One special filter condition that is not used in this example is the Success condition. The Success filter is designed to forward successfully processed records to the next stage for further processing.
Figure 15. Set up the reject link(See a larger version of Figure 16.)
This section uses a sample ETL job to illustrate the steps to extract data from the Teradata table named Orders. Figure 17 shows the sample job, which uses the immediate access mode. The job uses the Teradata connector named ExtractOrders to read the orders from the Orders database. The job transforms and passes the extracted data to the sequential file stage named SaveExtractedData.
The ExtractOrders connector uses the same table definition shown in Figure 12 and the same connection details shown in Figure 13.
Figure 17. Extract data from the Teradata Orders table
As shown in Figure 18, specify the following parameters for the data extraction operation:
Figure 18. Set up the data extraction operation
This section uses two ETL jobs to illustrate the steps to look up the Teradata data based on the input records. The examples query the order details based on the input order IDs. The following sections explain each of the two kinds of lookups that the DataStage supports: normal and sparse.
For normal lookup, all the referenced data is retrieved once from the target database and cached in memory or on disk. For each input record, the cached-referenced data is cross-checked to find the result.
Figure 19 shows the lookup stage and the Teradata connector of a sample job to perform a normal lookup. The Teradata connector performs a full table query on the Orders table and sends the query result to the lookup stage named NormalLookup. The lookup stage caches the query result and performs the lookup operations on the cached order details based on the order IDs from the OrderID input link. The results are sent to the output link named OrderDetails. This job requires one full table database query.
Figure 19. DataStage job to perform normal lookup
Two main steps are required to perform a normal lookup:
For sparse lookup, a database query is generated based on each input record and the query is sent to the target database to get the result.
Figure 22 shows the lookup stage and the Teradata connector of a sample job to perform a sparse lookup. For each order ID, the lookup stage named SparseLookup sends the order ID to the Teradata connector named GetOrderByID. The connector queries the order detail based on the order ID and returns the query result to the lookup stage. The lookup stage forwards the query result to the output link named OrderDetails. The job performs a database query for each order ID. Since there are four order IDs, the job performs four database queries.
Figure 22. DataStage job to perform sparse lookup
Two main steps are required to perform a normal lookup:
The Teradata Enterprise (TDEE) stage is a legacy Teradata stage available since DataStage Version 7.x. TDEE stage is a native PX-operator that provides the following features:
Figure 25 shows a sample ETL job that illustrates the TDEE data extract and load features. TDEE_Extract exports data from a Teradata database. TDEE supports data export from a table or using a user-defined SQL. TDEE_Load loads data into a Teradata table. TDEE supports loading data into a table only. User-defined SQL is not supported for the data load operation.
Figure 25. Data extract and load using TDEE
Figure 26 illustrates how to set up the TDEE_LOAD stage. TDEE can perform the following pre-load operations based on the write mode selection:
Figure 26. Set up the TDEE data load operation
The Teradata Enterprise stage uses the FastLoad protocol to load data into a table. The FastLoad protocol supports loading only into empty-tables. When the write mode Append is selected, the stage inserts data into a temporary work-table using the FastLoad protocol. After completing the data load operation, the stage inserts data into the target table using the following SQL:
insert into <target table> select * from <temporary work table>
The Teradata Enterprise stage operates in parallel mode. It supports the creation of multiple processes running on multiple computing nodes for the data load or export operation. If requestedsessions/sessionsperplayer properties are defined, they control the number of player processes spawned for the data operation. Otherwise, the default value for the number of player processes spawned for the data operation is set to half of the number of the Teradata Access Module processors.
The multiple loading processes need to be synchronized at various points during the data operation. A terasync database table is created or used for synchronization. A row is inserted into the table for each job that is run. Each player process updates the row to indicate its current status. The job aborts if all the player processes cannot be synchronized within the timeout period, which by default is 20 seconds. You can change the default by specifying synctimeout=<specified_value>
as an Additional Connection Option option on TDEE data load definition (see Figure 26).
The Teradata MultiLoad (TDMLoad) stage was originally designed for the DataStage server. The TDMLoad stage supports both data load and export. It internally uses the Teradata FastExport utility for export. It uses the Teradata MultiLoad or TPump utility for load. The TDMLoad stage also works on the DataStage PX. However, unlike the Teradata connector, it only runs in sequential mode. Running the TDMLoad in parallel mode is not supported.
Prior to the availability of the Teradata connector, the TDMLoad stage was mainly recommended and used for supporting the database update and upsert operations. The Teradata connector provides equivalent features via the update and stream drivers in the bulk mode option.
Figure 27 shows a sample ETL job that illustrates the TDMLOAD data export and load features. TDMLOAD_Export exports data from a Teradata database. TDMLOAD_Load_Update loads data into a Teradata table.
Figure 27. Data extract and load using TDMLOAD
Figure 28 shows how to set up the TDMLOAD data export operation. The following describes how the TDMLoad stage works for the data export operation:
Figure 28. Set up the TDMLOAD data export
Figure 29 shows how to set up the TDMLOAD data load operation. The following describes how The TDMLoad stage works for the data load operation:
Figure 29. Set up the TDMLOAD data load
The TDMLoad stage provides the option of writing the DataStage data to a data file in the Teradata FastLoad or VarText format. You can use the Teradata load utilities to load the data file outside of the DataStage at a later time. The Teradata connector does not support this feature.
The Teradata API (TDAPI) stage was designed for the DataStage server. It provides the functions for executing the SQL select/insert/update/upsert/delete statements via the Teradata DBC/SQL partition. It works on the DataStage PX in sequential mode. Running the stage in parallel mode is not supported.
The Teradata API stage processes one data record at a time. It does not leverage the Teradata DML array operation feature. The array operation sends many rows at once to the server. The stage was recommended for processing a small number of records. The immediate mode of the Teradata connector supports the SQL execution by the DBC/SQL partition. The connector also allows users to specify the array size to use the Teradata array operation feature.
Figure 30 shows the stage definition for the Teradata API stage being used to insert or update a database table.
Figure 30. Teradata API stage definition
The Teradata Load (terabulk) stage was also designed for the DataStage server. It uses the FastLoad utility and provides the function of loading bulk data into an empty database table. It works on the DataStage PX in sequential mode. Running the Teradata load stage in parallel mode is not supported. The Teradata connector provides the equivalent load feature via the load driver in the bulk mode option.
The Teradata Load stage provides the option of writing the DataStage data to a data file in the Teradata FastLoad or VarText format. The Teradata connector does not support this feature.
Figure 31 shows the stage definition for the Teradata Load stage when the stage is used to load data into a database table.
Figure 31. Teradata Load stage definition
As shown in Figure 32, you can invoke the legacy metadata import services via these menu items:
Figure 32. Legacy metadata import options
Selecting the Orchestrate Schema Definitions menu item starts the process of importing the file definition or database table schema using the PX operators. When the Teradata database type is selected, the process invokes the Teradata enterprise stage (TDEE) to import the selected Teradata table schema.
Selecting the Plug-in Meta Data Definitions menu item also starts the process of importing database table schema using the DataStage plug-in stages. When the Teradata database type is selected, the process invokes the Teradata API stage to import the selected Teradata table schema.
This article demonstrates how to integrate Teradata data with other data sources using the Teradata connectivity solutions within the IBM InfoSphere Information Server. It introduces the data loading, data extraction, and lookup features of the Teradata connector. It also explains the main features of the Teradata legacy stages. The Teradata connector provides a single solution to replace all the legacy stages. Many examples are given to illustrate the step-by-step design processes.
The IBM InfoSphere Information Server provides leading technology and integration solutions to address many critical data integration issues, including:
The authors would like to thank Tom O’Shea and Stewart Hanna for their reviews of this paper and Ravi Nandibhatla, Srinivas Mudigonda, and Vikram Vangari for their discussions on the STP and ODBC stages.