Although Teradata Utilities allow binary information in text files, doing so can have unintended consequences.
On network platforms, use of the text file format (each line/record containing zero or more characters followed by a platform-appropriate end-of-line) is problematic when indicators and/or non-character data (collectively, binary information) is present. In particular, false early end-of-lines can occur.
While this causes no problems on output (for example, BTEQ, FastExport, and the TPT Data Connector Operator), it can cause problems for an application or Teradata utility reading the resulting file. In particular, depending on the particular file, it will cause problems for Teradata utilities on input (for example, BTEQ, FastLoad, MultiLoad, TPump, and the TPT Data Connector Operator).
The Teradata utilities do not enforce a no-binary-information restriction. For most Teradata utilities, imposing such a rule at this time could cause existing operational jobs to fail.
This article explains why binary information in text files potentially causes problems, and how to avoid those problems.
A text file is a file that consists of zero or more lines. Each line consists of zero or more characters, followed by a platform-dependent end-of-line marker, which is:
A text file is specified as follows in the various Teradata utilities:
For purposes of this article, binary information is of two kinds:
The following data types contain character data; all other data types contain (at least potentially) non-character data, based on the Client-side (external) data representation (refer to Teradata Database SQL Data Types and Literals for further information):
The observant reader will note that VARCHAR and CLOB are not on the list. That's because VARCHAR data has a two-byte length prefix containing binary (non-character) information. Similarly, CLOB is not on the list due to its eight-byte length prefix.
UTF-16 character data with Latin characters will have bytes containing binary zero (for example, 'A' (U+0041) is encoded at X'0041' or X'4100', depending on the endianness). This does not cause any problems, since those bytes of zero are part of the character encoding, and are treated as such; similar "binary" bytes can occur in other character encodings.
Indicator bits are used to communicate whether or not a particular field is NULL. If indicator bits are used, the start of each record (or line) contains one or more indicator bytes, containing one bit per field assigned left-to-right (most significant to least significant). The number of indicator bytes is equal to the (number of fields + 7) / 8. An indicator bit is present for each field, whether or not a particular field is nullable.
The presence of indicators is specified as follows in the various Teradata utilities:
BTEQ does not allow indicators with REPORT format files.
Let's consider a Unix text input file with seven fields and indicator bits (one indicator byte). If the 5th and 7th fields are NULL, the indicator byte for such a record/line will be binary 00001010 (5th and 7th bits set), or X'0A', which will appear as a newline (a very early newline) when read by a Teradata utility (or any other program reading in text mode). The result will be an erroneous short record.
Similarly, let's consider a Windows text input file with fifteen fields and indicator bits (two indicator bytes). If the 5th, 6th, 8th, 13th, and 15th fields are NULL, the indicator bytes for such a record will be binary 0000110100001010, or X'0D0A', which will appear as a carriage return/line feed sequence when read by a Teradata utility (or any other program). Again, the result will be an erroneous short record.
Extensions for more than seven or fifteen fields are left as an exercise for the reader; the basic issue is the presence in the indicator bytes of a X'0A' (Unix) or X'0D0A' (Windows) pattern somewhere in the indicator bytes.
The previous examples assume a non-UTF-16 client session character set; similar examples for UTF-16 should be obvious.
The likelihood of the bit patterns described above may seem unlikely. However, there is a larger problem.
If non-character data is present in the record/line, one or more bytes within those non-character values can cause the same problem as indicator byte(s). For example, if there is an INTEGER field with a value of decimal 10, the value will be X'0000000A' (big-endian) or X'0A000000' (little-endian), and in either case under Unix an end-of-line will be assumed when the X'0A' byte is seen. An equivalent (if less likely) problem arises with Windows (which is always little-endian): an INTEGER field with a value of 2573 (X'0D0A0000') will be seen as end-of-line.
Similar problems can arise with variable-length fields.
VARCHAR fields have a two-byte length ("n") followed by "n" bytes of character data. So, if on Unix a VARCHAR field has a length of ten bytes, the two-byte length will be either X'000A' (big-endian) or X'0A00' (little-endian), and in either case an end-of-line will be assumed. Similarly (although less likely), with Windows, if a VARCHAR field with a length of 2573 bytes (x'0D0A') is present, the length field will cause an end-of-line to be assumed.
Similar problems arise with CLOB data, which has an eight-byte length prefixing the character data.
VARBYTE and BLOB data suffer from the same length-prefix problems, but since those data types contain binary data, they're problematic in any case.
While not always be possible, the most straightforward workaround is to switch to a file format other than text. In the case of most Teradata utilities, it is usually possible to switch to unformatted format, which uses a layout or schema to map the incoming data to fields. An additional BYTE(n) filler field must be added to absorb the one, two, or four bytes that comprise the end-of-line marker.
Other alternatives include:
Additional information on the Teradata utilities mentioned above may be found in:
Additional information related to NULL may be found in the Manipulating Nulls topic in chapter 4 (SQL Data Handling) of:
Additional information on external data representation may be found in: