External Data formats supported by the TPT DataConnector Operator

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

External Data formats supported by the TPT DataConnector Operator

The TPT DataConnector Operator is the mechanism by which TPT obtains data from or sends data to external native operating system flat (sequential) files or attached access modules.

This article will address the data formats usable with the TPT DataConnector Operator. For each format, we will first deal with the concept of a record, which roughly maps to a DBS table row. Secondly, we will address how these data are resolved to row columns using the TPT data schema. When these two requirements are validated, we will have converted the external data record to a Teradata DBS row. If not, we have a fatal data error*.

There are several external file data formats supported. The examples here include ASCII encoded characters and big endian integer values. The TPT DataConnector operator is instructed as to the desired format via the Format attribute as follows;

Format = 'Formatted'

Each record is in a format traditionally known as FastLoad or Teradata format. The data is prefixed with the data length and followed by and end-of-record (EOR) marker. This is the most stringent and robust data format and will allow any TPT supported data type to be included.

Let’s consider a record containing 6 bytes of inferred text data "abcdef";

x'00 06 61 62 63 64 65 66 0a' and in character: "..abcdef."

The DataConnector interprets the first 2 bytes as the record length (6 bytes) and expects the EOR marker following the last byte of data. If the EOR is not found, a fatal data error will result.

Since there are no internal column indications in this record (as there are with delimited data – see below), the TPT schema must map the record contents to the row columns. E.g., this data could indicate a single column row. If so, the TPT schema would be;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 CHAR(6)
);

Another possibility is 3, 2-byte columns for which the schema would be;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 CHAR(2),
COL2 CHAR(2),
COL3 CHAR(2)
);

Which would define 3 columns; “ab”, “cd”, and “ef”.

The possible permutations of 6 bytes defined as character columns can be inferred.

As noted above, any data type(s) can be included between the length indicator (the first 2 bytes) and the EOR byte (x’0a’). To example the point, consider this schema;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
( COL1 CHAR(2),
COL2 INTEGER
);

This would define 2 columns; “ab” and an integer 1667523942.

Format = 'Text'

There is no length indicator as with formatted data (above). Records are separated only by an end-of-record (EOR) marker. To insure data integrity, each record must be entirely character data. I.e., The TPT schema must contain all character types (e.g., CHAR and ANSIDATE). The EOR marker can be either a single-byte linefeed (X'0A') or a double-byte carriage-return/line-feed pair (X'0D0A'). Whether a single or double byte EOR is used is determined by the EOR marker encountered for the first record. If the first record has a single byte EOR and contains the text "abcd", it would look like this:

x'61 63 64 63 0a' and in character: "abcd."

Every subsequent record is expected to use the same single byte EOR.

If the first record has a double byte EOR and contains the text "acdc", it would look like this:

x'61 63 64 63 0d 0a' and in character: "abcd.."

Again, every subsequent record is expected to use the same double-byte EOR.

Single and double byte EORs cannot be mixed in the same data.

Exemplar schemas matching this data might be;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 CHAR(4)
);

Or

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 CHAR(2),
COL2 CHAR(2)
);

The possible permutations of 4 bytes defined as character columns can be inferred.

Format = 'Delimited'

This format is an extension of the text format. It is similar because it must contain all character data and be terminated by an EOR marker. But it has a different restriction: all columns in the TPT schema must be VARCHAR** (as opposed to CHAR/ANSIDATE for simple text). The advantage of this format is that it supports an additional feature that allows the definition of individual fields (which map to DBS table row columns) separated by the special delimiter character. To successfully parse delimited data, the delimiter character can not appear within the data. Delimiter data embedded within a column will result in a data column count error.

The delimiter character is provided to the TPT DataConnector operator via it's TextDelimiter attribute. If not provided, the TextDelimiter attribute defaults to the "pipe" character ('|'). E.g., Using the colon (via TextDelimiter = ':' ) as the delimiter, a 2 field (text "1234" and "abcd") record would be;

x'31 32 33 34 3a 61 62 63 64 0a' and in character: "1234:abcd."

Notice that the record is defined exactly like a text formatted record.

In the TPT schema, all columns must be VARCHAR**. The indicated length in the schema is the maximum length that the corresponding column can consume. An example of a schema that would the describe the above data is;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 VARCHAR(6),
COL2 VARCHAR(7)
);

Format = 'Binary'

This format is the same as formatted (see above) except that there is no trailing EOR byte. Each record contains a 2-byte integer data length, n, followed by n bytes of data. As with formatted data, all TPT data types are supported.

E.g., a record containing 4 bytes of text "abcd" would be;

x'00 06 61 62 63 64' and in character: "..abcd"

In stream data as on Unix platforms, the records are contiguous (in memory). For example, 3 sequential 4-byte records might appear as follows;

x'00 04 61 62 63 64 00 04 66 67 68 69 00 04 61 62 63 64'

In ASCII, the 3, 4-byte records are “abcd”, “fghi” and “abcd”.

An example of a TPT SCHEMA statement that would describe this data is

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 CHAR(4)
);

For the sake of interest, let’s look at a schema showing several typical data types in use. Consider this TPT SCHEMA;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 CHAR(8),
COL2 VARCHAR(6),
COL3 INTEGER,
COL4 VARCHAR(20),
COL5 VARCHAR(4)
);* Excessive columns are acceptable if the AcceptExcessColumns attribute is set.

Here’s an example of a buffer containing 2 records which contain data for 2 rows acceptable to this schema;

---------- hexadecimal representation --------- ----- ASCII -----
00 21 61 62 63 64 65 66 67 68 00 05 64 65 66 67 “..abcdefgh..defg”
68 00 00 00 10 00 07 61 62 63 64 65 66 67 00 03 “h......abcdefg..”
68 69 70 00 26 68 67 66 65 64 63 62 61 00 06 64 “hij..hgfedcba..d”
65 66 67 68 69 00 00 11 11 00 0a 61 62 63 64 65 “efghi......abcde”
66 67 68 69 70 00 04 61 62 63 64 “fghij..abcd”

The length for the first record is x’0021’ = 33 bytes. The length of the 2nd record is x’0026’ = 38 bytes.

This data would appear as;

COL1 COL2 COL3 COL4 COL5

Row 1: abcdefgh defgh 16 abcdefg hij

Row 2: hgfedcba defghi 457 abcdefghij abcd

Format = 'Unformatted'

The data does not conform to any external format. There is neither a prepended length as with binary and formatted nor is there an EOR marker as with text and delimited. The data is entirely described by the specified TPT script SCHEMA statement. E.g., given the TPT schema;

DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
COL1 INTEGER,
COL2 CHAR(4),
COL3 VARCHAR(8)
);

a record containing the data: INTEGER 20, CHAR "1234", VARCHAR "abcde" would be;

x'00 00 00 14 31 32 33 34 00 05 61 62 63 64 65'

where

x'00 00 00 14' is the integer 20,

x'31 32 33 34' is the fixed length character "1234"

x'00 05' is the length of the VARCHAR field and

x'61 62 63 64 65' is the VARCHAR character data "abcde"

**In this case, VAR LOB data types are also supported.

4 REPLIES
Teradata Employee

Re: External Data formats supported by the TPT DataConnector Operator

It would be more than appreciated to finaly have complete CSV and XML support built in these tools....

Enthusiast

Re: External Data formats supported by the TPT DataConnector Operator

Well, CSV file representation is as good as having the Delimiter set to ",", and using the Delimited format. XML option is surely missing.
Q_
Enthusiast

Re: External Data formats supported by the TPT DataConnector Operator

Anyone aware of access module that can use with dataconnector to read unix .gz files? I could create a set of FIFOS and background some gzcat's and read it that way - but wanted to see if someone had tackled this.
Enthusiast

Re: External Data formats supported by the TPT DataConnector Operator

well, can anyone give me some tips about handling double-byte in TPT script? In my project, i use the "|" as delimiter, but some double-byte characters in the source file have the '7C' value here. The load operator can not do the right parsing. Any tips? Thanks!