TEXT and INDICATOR's / non-character data could cause data corruption

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

TEXT and INDICATOR's / non-character data could cause data corruption

The use of TEXT format and INDICATORS mode when Teradata load utilities are used to load non-character data can lead to problems. This article will discuss this issue in more detail and describe what has been (will be) done to strongly discourage this usage.

Background and Problem Description

The TEXT file format, supported by each of the Teradata standalone utilities FastLoad, FastExport, MultiLoad and TPump (via PIOM) and by TPT (via the Data Connector Operator), consists of zero or more lines ("records"). Each line contains zero or more characters followed by the platform-appropriate end-of -line indication:

  • On Unix, newline (NL)
  • x'0A' for all character sets except UTF-16
  • U+000A for UTF-16 (x'000A' for big-endian; x'0A00' for little-endian)
  • On Windows, carriage return/line feed (CRLF)
  • x'0D0A' for all character sets except UTF-16
  • U+000D U+000A for UTF-16 (x'000D000A' for big-endian; x'0D000A00' for little endian)
  • On mainframe, there is no special end-of-record character or sequence; the record boundary inherent in the underlying file system serves as the end-of-record indicator

If a file has INDICATORS, then at the start of each record there are one or more indicator bytes, containing one bit per field, assigned left-to-right (most significant to least significant). The number of indicator bytes is equal to the (number of fields + 7)/8. An indicator bit is present for each field, whether or not a particular field is nullable.

If we posit a Unix TEXT input file with seven fields and INDICATORS, then a problem will occur if the 5th and 7th fields are NULL. This is because the indicator byte for such a record is binary 00001010 (5th and 7th bits set), or x'0A'. Similarly, if we posit a Windows TEXT input file with fifteen fields and INDICATORS, then a problem will occur if the 5th, 6th, 8th, 13th, and 15th fields are NULL (indicator bytes 0000110100001010, or x'0D0A'). The previous examples assume a non-UTF-16 network character set; similar examples for UTF-16 should be obvious. Extensions for more than seven or fifteen fields are left as an exercise for the reader; the basic issue is the presence in the indicator bytes of an x'0A' (UNIX) or x'0D0A' (Windows) pattern.

The likelihood of such a pattern may seem unlikely. However, there is a larger problem. If non-character data is present in the record, one or more bytes within those non-character values can cause the same problem as indicator byte(s). For example, if there is an INTEGER field with a value of decimal 10, the value will be x'0000000A' (big-endian) or x'0A000000' (little-endian), and in either case under UNIX an end-of-record will be assumed when the x'0A' byte is seen. An equivalent (if less likely) problem arises with Windows (always little-endian): an INTEGER field with a value of 2573 (x'0D0A0000') will be seen as end-of-record.

Similar problems can arise with VARCHAR data. VARCHAR fields have a two-byte length ("n") followed by "n" bytes of character data. So, if on Unix a VARCHAR field has a length of ten bytes, the two-byte length will be either x'000A' (big-endian) or x'0A00' (little-endian), and in either case an end-of-record will be assumed. Similarly (although less likely), with Windows, if a VARCHAR field with a length of 2573 bytes (x'0D0A') is present, the length field will cause an end-of-record to be assumed.

Definition of "non-character data"

For purposes of this note, the following data types contain character data; all other data types contain (at least potentially) non-character data, based on the Client-side (external) data representation (based on information in SQL Reference: Data Types and Literals):

  • DATE (only when DATEFORM is ANSIDATE)
  • TIME
  • TIME WITH TIME ZONE
  • TIMESTAMP
  • TIMESTAMP WITH TIME ZONE
  • INTERVAL (all variants)
  • PERIOD(DATE) (only when DATEFORM is ANSIDATE)
  • PERIOD(TIME)
  • CHARACTER
  • Distinct UDTs, if the predefined SQL data type is one of the above types
  • Structured UDTs, if the predefined SQL data type used for import/export is one of the above types

Notes:

  • VARCHAR is not on the list, since the two-byte length field is binary (non-character). Similarly, CLOB is not on the list due to its eight-byte binary length field.
  • UTF-16 character data with Latin characters will have bytes containing binary zero (for example, 'A' (U+0041) is encoded at x'0041' or x'4100', depending on the endianness); this does not cause any problems, since those bytes of zero are part of the character encoding; similar "binary" bytes can occur in other character encodings.

What had been done

1. Strengthen Documentation Warnings in Teradata Standalone Load/Unload Utilities manuals

The documentation warnings already proposed should be strengthened to strongly discourage the use of INDICATORS or non-character data with TEXT format files on network-attached Clients. In particular:

  • FastExport: warn that such files may not be readable by other programs (either Teradata utilities or vendors' products) as TEXT files. Recommend use of UNFORMATTED format instead (assuming the receiving program can handle it).
  • FastLoad, MultiLoad, TPump, FastExport: warn that such files may not be readable as TEXT files. Frequently, the file can be defined as UNFORMATTED and read successfully (see below for UNFORMATTED workaround). Note: FastExport is listed here for input, since FastExport supports an optional selection file as input.

Teradata utilities can generally read a TEXT file as UNFORMATTED by changing the description of the input to include the end-of-record indicator as a filler CHARACTER field of length appropriate for the platform and character set.

2. Add Warning to Teradata Standalone Load/Unload Utilities

Stand Alone load/unload utilities (network platforms only) are changed to issue messages when an attempt is made to use a TEXT file with INDICATORS and/or non-character data. This change does not impact mainframe because the TEXT record format is not supported on mainframe.

The rationale for generating such messages is two-fold:

  • Some users might not  read the document; this provides a fallback method of communication.
  • If the user receives a "short record" error while reading such a TEXT file, there's a smoking gun in the STDOUT to help them diagnose the problem and identify the cause.

The above SA changes have been shipped into TTU14.0. TPT will provide equivalent warnings in TPT documentations, and the Data Connector Operator will be modified to:

  • Error if the user attempts to output a TEXT file with INDICATORS and/or non-character data
  • Warn (or inform) if the user attempts to read a TEXT file with INDICATORS and/or non-character data)

Sample TPump scripts and corresponding outputs

This TPump script wrongly uses TEXT format and INDICATORS mode:

/*************************************************/
/* */
/* Author: Ivy Yuan */
/* */
/*************************************************/

.LOGTABLE TPump_text_log;
.LOGON TDPid/Username, Password;

drop table testtbl;
drop table myerrtbl;

/*************************************************/
/* Create table */
/*************************************************/

CREATE MULTISET TABLE testtbl, FALLBACK (
c1 char(5),
c2 char(5))
PRIMARY INDEX(C1);

.BEGIN LOAD
SESSIONS 4 1
ERRORTABLE myerrtbl
ERRLIMIT 20
ROBUST off
PACK 5
LATENCY 10
;

.LAYOUT LAY1A INDICATORS;
.FIELD c1 * char(5);
.FIELD c2 * char(5);
.DML LABEL LABELA;

INSERT INTO testtbl VALUES (:c1,:c2);

.IMPORT INFILE mydata.txt
FORMAT text
LAYOUT lay1a
APPLY labela;

.END LOAD;

.logoff ;

Here is the corresponding TPump output, a warning message UTY8758 is issued:

**** 14:09:29 UTY6633 WARNING: No configuration file, using build defaults

========================================================================
= =
= Teradata Parallel Data Pump Utility Release 14.00.00.004 =
= Platform WIN32 =
= =
========================================================================
= =
= Copyright 1997-2011 Teradata Corporation. ALL RIGHTS RESERVED. =
= =
========================================================================

**** 14:09:29 UTY2411 Processing start date: WED SEP 21, 2011

========================================================================
= =
= Logon/Connection =
= =
========================================================================

0001 /*************************************************/
/* */
/* Author: Ivy Yuan */
/* */
/*************************************************/

.LOGTABLE TPump_text_log;

0002 .LOGON TDPid/Username,;

**** 14:09:31 UTY8400 Teradata Database Release: 14f.00.00.351

**** 14:09:31 UTY8400 Teradata Database Version: 14f.00.00.351

**** 14:09:31 UTY8400 Default character set: ASCII

**** 14:09:31 UTY8400 Current RDBMS has UDT support

**** 14:09:31 UTY8400 Maximum supported buffer size: 1M

**** 14:09:31 UTY8400 Upsert supported by RDBMS server

**** 14:09:31 UTY8400 Data Encryption supported by RDBMS server

**** 14:09:31 UTY8400 Array Support supported by RDBMS server

**** 14:09:31 UTY8400 Statement Independence supported by RDBMS server

**** 14:09:31 UTY6211 A successful connect was made to the RDBMS.

**** 14:09:31 UTY6217 Logtable 'IVY.TPump_text_log' has been created.

========================================================================
= =
= Processing Control Statements =
= =
========================================================================

0003 drop table testtbl;

**** 14:09:32 UTY1016 'DROP' request successful.

0004 drop table myerrtbl;

**** 14:09:32 UTY1008 RDBMS failure: 3807, Object 'myerrtbl' does not exist.

0005 /*************************************************/
/* Create table */
/*************************************************/

CREATE MULTISET TABLE testtbl, FALLBACK (
c1 char(5),
c2 char(5))
PRIMARY INDEX(C1);

**** 14:09:32 UTY1016 'CREATE' request successful.

0006 .BEGIN LOAD
SESSIONS 4 1
ERRORTABLE myerrtbl
ERRLIMIT 20
ROBUST off
PACK 5
LATENCY 10
;

========================================================================
= =
= Processing TPump Statements =
= =
========================================================================

0007 .LAYOUT LAY1A INDICATORS;

0008 .FIELD c1 * char(5) ;

0009 .FIELD c2 * char(5) ;

0010 .DML LABEL LABELA;

0011 INSERT INTO testtbl VALUES (:c1,:c2);

0012 .IMPORT INFILE mydata.txt
FORMAT text
LAYOUT lay1a
APPLY labela;

**** 14:09:32 UTY8758 WARNING: lay1a has INDICATORS, which is not recommended
when using TEXT record format. Please use UNFORMATTED record format
instead.

0013 .END LOAD;

**** 14:09:32 UTY6609 Starting to log on sessions...

**** 14:09:32 UTY6610 Logged on 4 sessions.

========================================================================
= =
= TPump Import(s) Beginning =
= =
========================================================================

**** 14:09:32 UTY6630 Options in effect for following TPump Import(s):
. Tenacity: 4 hour limit to successfully connect load sessions.
. Max Sessions: 4 session(s).
. Min Sessions: 1 session(s).
. Checkpoint: 15 minute(s).
. Errlimit: 20 rejected record(s).
. Restart Mode: SIMPLE.
. Serialization: OFF.
. Packing: 5 Statements per Request.
. StartUp Rate: UNLIMITED Statements per Minute.
. Latency: 10 Seconds.

**** 14:09:33 UTY8802 WARNING: Rate Monitoring turned off - database TPumpMacro
does not exist.

**** 14:09:33 UTY6608 Import 1 begins.

**** 14:09:33 UTY6641 Since last chkpt., 2 recs. in, 2 stmts., 1 reqs

**** 14:09:33 UTY6647 Since last chkpt., avg. DBS wait time: 109.00

**** 14:09:33 UTY6612 Beginning final checkpoint...

**** 14:09:33 UTY6641 Since last chkpt., 2 recs. in, 2 stmts., 1 reqs

**** 14:09:33 UTY6647 Since last chkpt., avg. DBS wait time: 109.00

**** 14:09:33 UTY6607 Checkpoint Completes with 2 rows sent.

**** 14:09:33 UTY6642 Import 1 statements: 2, requests: 1

**** 14:09:33 UTY6643 Import 1 average statements per request: 2.00

**** 14:09:33 UTY6644 Import 1 average statements per record: 1.00

**** 14:09:33 UTY6645 Import 1 statements/session: avg. 0.50, min. 0.00, max.

2.00

**** 14:09:33 UTY6646 Import 1 requests/session: average 0.25, minimum 0.00,

maximum 1.00

**** 14:09:33 UTY6648 Import 1 DBS wait time/session: avg. 27.25, min. 0.00,

max. 109.00

**** 14:09:33 UTY6649 Import 1 DBS wait time/request: avg. 27.25, min. 0.00,

max. 109.00

**** 14:09:33 UTY1803 Import processing statistics
. IMPORT 1 Total thus far
. ========= ==============
Candidate records considered:........ 2....... 2
Apply conditions satisfied:.......... 2....... 2
Records logable to error table:...... 0....... 0
Candidate records rejected:.......... 0....... 0

**** Statistics for Apply Label : LABELA
Type: I
Database: IVY
Table or Macro Name: testtbl
Activity: 2

**** 14:09:33 UTY6677 Loading phase statistics
Elapsed time: 00:00:00:00(dd:hh:mm:ss)
CPU time: 0 Seconds
MB/sec: N/A
MB/cpusec: N/A

**** 14:09:33 UTY0821 Error table IVY.myerrtbl is EMPTY, dropping table.

0014 .logoff ;

========================================================================
= =
= Logoff/Disconnect =
= =
========================================================================

**** 14:09:34 UTY6216 The restart log table has been dropped.

**** 14:09:34 UTY6212 A successful disconnect was made from the RDBMS.

**** 14:09:34 UTY2410 Total processor time used = '0.234375 Seconds'
. Start : 14:09:28 - WED SEP 21, 2011
. End : 14:09:34 - WED SEP 21, 2011
. Highest return code encountered = '4'.

This TPump script correctly uses UNFORMAT format instead to load the same data file:

/*************************************************/
/* */
/* Author: Ivy Yuan */
/* */
/*************************************************/

.LOGTABLE TPump_unformat_log;
.LOGON TDPid/Username, Password;

drop table testtbl;
drop table myerrtbl;

/*************************************************/
/* Create table */
/*************************************************/

CREATE MULTISET TABLE testtbl, FALLBACK (
c1 char(5),
c2 char(5))
PRIMARY INDEX(C1);

.BEGIN LOAD
SESSIONS 4 1
ERRORTABLE myerrtbl
ERRLIMIT 20
ROBUST off
PACK 5
LATENCY 10
;

.LAYOUT LAY1A INDICATORS;

.FIELD c1 * char(5);
.FIELD c2 * char(5);
.FILLER c3 * char(2);

.DML LABEL LABELA;

INSERT INTO testtbl VALUES (:c1,:c2);

.IMPORT INFILE mydata.txt
FORMAT unformat
LAYOUT lay1a
APPLY labela;

.END LOAD;

.logoff ;

Here is the corresponding TPump output:

**** 14:09:36 UTY6633 WARNING: No configuration file, using build defaults

========================================================================
= =
= Teradata Parallel Data Pump Utility Release 14.00.00.004 =
= Platform WIN32 =
= =
========================================================================
= =
= Copyright 1997-2011 Teradata Corporation. ALL RIGHTS RESERVED. =
= =
========================================================================

**** 14:09:36 UTY2411 Processing start date: WED SEP 21, 2011

========================================================================
= =
= Logon/Connection =
= =
========================================================================

0001 /*************************************************/
/* */
/* Author: Ivy Yuan */
/* */
/*************************************************/

.LOGTABLE TPump_unformat_log;

0002 .LOGON TDPid/Username,;

**** 14:09:39 UTY8400 Teradata Database Release: 14f.00.00.351

**** 14:09:39 UTY8400 Teradata Database Version: 14f.00.00.351

**** 14:09:39 UTY8400 Default character set: ASCII

**** 14:09:39 UTY8400 Current RDBMS has UDT support

**** 14:09:39 UTY8400 Maximum supported buffer size: 1M

**** 14:09:39 UTY8400 Upsert supported by RDBMS server

**** 14:09:39 UTY8400 Data Encryption supported by RDBMS server

**** 14:09:39 UTY8400 Array Support supported by RDBMS server

**** 14:09:39 UTY8400 Statement Independence supported by RDBMS server

**** 14:09:39 UTY6211 A successful connect was made to the RDBMS.

**** 14:09:39 UTY6217 Logtable 'IVY.TPump_unformat_log' has been created.

========================================================================
= =
= Processing Control Statements =
= =
========================================================================

0003 drop table testtbl;

**** 14:09:40 UTY1016 'DROP' request successful.

0004 drop table myerrtbl;

**** 14:09:40 UTY1008 RDBMS failure: 3807, Object 'myerrtbl' does not exist.

0005 /*************************************************/
/* Create table */
/*************************************************/

CREATE MULTISET TABLE testtbl, FALLBACK (
c1 char(5),
c2 char(5))
PRIMARY INDEX(C1);

**** 14:09:40 UTY1016 'CREATE' request successful.

0006 .BEGIN LOAD
SESSIONS 4 1
ERRORTABLE myerrtbl
ERRLIMIT 20
ROBUST off
PACK 5
LATENCY 10
;

========================================================================
= =
= Processing TPump Statements =
= =
========================================================================

0007 .LAYOUT LAY1A INDICATORS;

0008 .FIELD c1 * char(5);

0009 .FIELD c2 * char(5);

0010 .FILLER c3 * char(2);

0011 .DML LABEL LABELA;

0012 INSERT INTO testtbl VALUES (:c1,:c2);

0013 .IMPORT INFILE mydata.txt
FORMAT unformat
LAYOUT lay1a
APPLY labela;

0014 .END LOAD;

**** 14:09:40 UTY6609 Starting to log on sessions...

**** 14:09:41 UTY6610 Logged on 4 sessions.

========================================================================
= =
= TPump Import(s) Beginning =
= =
========================================================================

**** 14:09:41 UTY6630 Options in effect for following TPump Import(s):

. Tenacity: 4 hour limit to successfully connect load sessions.
. Max Sessions: 4 session(s).
. Min Sessions: 1 session(s).
. Checkpoint: 15 minute(s).
. Errlimit: 20 rejected record(s).
. Restart Mode: SIMPLE.
. Serialization: OFF.
. Packing: 5 Statements per Request.
. StartUp Rate: UNLIMITED Statements per Minute.
. Latency: 10 Seconds.

**** 14:09:42 UTY8802 WARNING: Rate Monitoring turned off - database TPumpMacro
does not exist.

**** 14:09:42 UTY6608 Import 1 begins.

**** 14:09:42 UTY6641 Since last chkpt., 2 recs. in, 2 stmts., 1 reqs

**** 14:09:42 UTY6647 Since last chkpt., avg. DBS wait time: 94.00

**** 14:09:42 UTY6612 Beginning final checkpoint...

**** 14:09:42 UTY6641 Since last chkpt., 2 recs. in, 2 stmts., 1 reqs

**** 14:09:42 UTY6647 Since last chkpt., avg. DBS wait time: 94.00

**** 14:09:42 UTY6607 Checkpoint Completes with 2 rows sent.

**** 14:09:42 UTY6642 Import 1 statements: 2, requests: 1

**** 14:09:42 UTY6643 Import 1 average statements per request: 2.00

**** 14:09:42 UTY6644 Import 1 average statements per record: 1.00

**** 14:09:42 UTY6645 Import 1 statements/session: avg. 0.50, min. 0.00, max.

2.00

**** 14:09:42 UTY6646 Import 1 requests/session: average 0.25, minimum 0.00,

maximum 1.00

**** 14:09:42 UTY6648 Import 1 DBS wait time/session: avg. 23.50, min. 0.00,

max. 94.00

**** 14:09:42 UTY6649 Import 1 DBS wait time/request: avg. 23.50, min. 0.00,

max. 94.00

**** 14:09:42 UTY1803 Import processing statistics

. IMPORT 1 Total thus far
. ========= ==============
Candidate records considered:........ 2....... 2
Apply conditions satisfied:.......... 2....... 2
Records logable to error table:...... 0....... 0
Candidate records rejected:.......... 0....... 0

**** Statistics for Apply Label : LABELA

Type: I
Database: IVY
Table or Macro Name: testtbl
Activity: 2

**** 14:09:42 UTY6677 Loading phase statistics

Elapsed time: 00:00:00:00(dd:hh:mm:ss)
CPU time: 0 Seconds
MB/sec: N/A
MB/cpusec: N/A

**** 14:09:43 UTY0821 Error table IVY.myerrtbl is EMPTY, dropping table.

0015 .logoff ;

========================================================================
= =
= Logoff/Disconnect =
= =
========================================================================

**** 14:09:43 UTY6216 The restart log table has been dropped.

**** 14:09:43 UTY6212 A successful disconnect was made from the RDBMS.

**** 14:09:43 UTY2410 Total processor time used = '0.21875 Seconds'

. Start : 14:09:36 - WED SEP 21, 2011
. End : 14:09:43 - WED SEP 21, 2011
. Highest return code encountered = '0'.