Why binary information and text files don't mix

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
bwb
Teradata Employee

Why binary information and text files don't mix

Although Teradata Utilities allow binary information in text files, doing so can have unintended consequences.

Summary

On network platforms, use of the text file format (each line/record containing zero or more characters followed by a platform-appropriate end-of-line) is problematic when indicators and/or non-character data (collectively, binary information) is present. In particular, false early end-of-lines can occur.

While this causes no problems on output (for example, BTEQ, FastExport, and the TPT Data Connector Operator), it can cause problems for an application or Teradata utility reading the resulting file. In particular, depending on the particular file, it will cause problems for Teradata utilities on input (for example, BTEQ, FastLoad, MultiLoad, TPump, and the TPT Data Connector Operator).

The Teradata utilities do not enforce a no-binary-information restriction. For most Teradata utilities, imposing such a rule at this time could cause existing operational jobs to fail.

This article explains why binary information in text files potentially causes problems, and how to avoid those problems.

What's a text file?

A text file is a file that consists of zero or more lines. Each line consists of zero or more characters, followed by a platform-dependent end-of-line marker, which is:

  • On Unix platforms, the newline character (identified in Unicode as LINE FEED U+000A):
    • X'0A' for all client session character sets except UTF-16
    • U+000A for UTF-16 (X'000A' for big-endian; X'0A00' for little-endian)
  • On Windows platforms, the two-character sequence carriage return followed by line feed (identified in Unicode as CARRIAGE RETURN U+000D and LINE FEED U+000A, respectively)
    • X'0D0A' for all character sets except UTF-16
    • U+000D U+000A for UTF-16 (X'000D000A' for big-endian; X'0D000A00' for little endian)
  • On mainframe platforms, there is no special end-of-line character or sequence; the record boundary inherent in the underlying file system serves as the end-of-line indicator

A text file is specified as follows in the various Teradata utilities:

  • In BTEQ, by the IMPORT REPORT and EXPORT REPORT commands
  • In FastLoad, by the SET RECORD TEXT command
  • In MultiLoad and TPump, by the FORMAT TEXT option on the IMPORT command
  • In FastExport, by the FORMAT TEXT option on the IMPORT and EXPORT commands
  • In TPT, by the Format = 'Text' Data Connector Operator attribute

What's binary information?

For purposes of this article, binary information is of two kinds:

  • Binary (non-character) data
  • Indicator bytes

Character vs. non-character data

The following data types contain character data; all other data types contain (at least potentially) non-character data, based on the Client-side (external) data representation (refer to Teradata Database SQL Data Types and Literals for further information):

  • DATE (only when DATEFORM is ANSIDATE)
  • TIME
  • TIME WITH TIME ZONE
  • TIMESTAMP
  • TIMESTAMP WITH TIME ZONE
  • INTERVAL (all variants)
  • CHARACTER
  • Distinct UDTs, if the predefined SQL data type is one of the above types
  • Structured UDTs, if the predefined SQL data type used for import/export is one of the above types

The observant reader will note that VARCHAR and CLOB are not on the list. That's because VARCHAR data has a two-byte length prefix containing binary (non-character) information. Similarly, CLOB is not on the list due to its eight-byte length prefix.

UTF-16 character data with Latin characters will have bytes containing binary zero (for example, 'A' (U+0041) is encoded at X'0041' or X'4100', depending on the endianness). This does not cause any problems, since those bytes of zero are part of the character encoding, and are treated as such; similar "binary" bytes can occur in other character encodings.

Indicator bytes

Indicator bits are used to communicate whether or not a particular field is NULL. If indicator bits are used, the start of each record (or line) contains one or more indicator bytes, containing one bit per field assigned left-to-right (most significant to least significant). The number of indicator bytes is equal to the (number of fields + 7) / 8. An indicator bit is present for each field, whether or not a particular field is nullable.

The presence of indicators is specified as follows in the various Teradata utilities:

  • In FastLoad, by the INDICATORS option on the BEGIN LOADING command
  • In MultiLoad and TPump, by the INDICATORS option on the LAYOUT command
  • In FastExport, by the INDICATORS option on the IMPORT command and the MODE INDICATOR on the EXPORT command
  • In TPT, by the IndicatorMode = 'Yes' Data Connector Operator attribute

BTEQ does not allow indicators with REPORT format files.

Why binary information causes problems

Indicators

Let's consider a Unix text input file with seven fields and indicator bits (one indicator byte). If the 5th and 7th fields are NULL, the indicator byte for such a record/line will be binary 00001010 (5th and 7th bits set), or X'0A', which will appear as a newline (a very early newline) when read by a Teradata utility (or any other program reading in text mode). The result will be an erroneous short record.

Similarly, let's consider a Windows text input file with fifteen fields and indicator bits (two indicator bytes). If the 5th, 6th, 8th, 13th, and 15th fields are NULL, the indicator bytes for such a record will be binary 0000110100001010, or X'0D0A', which will appear as a carriage return/line feed sequence when read by a Teradata utility (or any other program). Again, the result will be an erroneous short record.

Extensions for more than seven or fifteen fields are left as an exercise for the reader; the basic issue is the presence in the indicator bytes of a X'0A' (Unix) or X'0D0A' (Windows) pattern somewhere in the indicator bytes.

The previous examples assume a non-UTF-16 client session character set; similar examples for UTF-16 should be obvious.

Non-Character Data

The likelihood of the bit patterns described above may seem unlikely. However, there is a larger problem.

If non-character data is present in the record/line, one or more bytes within those non-character values can cause the same problem as indicator byte(s). For example, if there is an INTEGER field with a value of decimal 10, the value will be X'0000000A' (big-endian) or X'0A000000' (little-endian), and in either case under Unix an end-of-line will be assumed when the X'0A' byte is seen. An equivalent (if less likely) problem arises with Windows (which is always little-endian): an INTEGER field with a value of 2573 (X'0D0A0000') will be seen as end-of-line.

Variable-length field length information

Similar problems can arise with variable-length fields.

VARCHAR fields have a two-byte length ("n") followed by "n" bytes of character data. So, if on Unix a VARCHAR field has a length of ten bytes, the two-byte length will be either X'000A' (big-endian) or X'0A00' (little-endian), and in either case an end-of-line will be assumed. Similarly (although less likely), with Windows, if a VARCHAR field with a length of 2573 bytes (x'0D0A') is present, the length field will cause an end-of-line to be assumed.

Similar problems arise with CLOB data, which has an eight-byte length prefixing the character data.

VARBYTE and BLOB data suffer from the same length-prefix problems, but since those data types contain binary data, they're problematic in any case.

Workarounds

While not always be possible, the most straightforward workaround is to switch to a file format other than text. In the case of most Teradata utilities, it is usually possible to switch to unformatted format, which uses a layout or schema to map the incoming data to fields. An additional BYTE(n) filler field must be added to absorb the one, two, or four bytes that comprise the end-of-line marker.

  • In FastLoad, use the SET RECORD UNFORMATTED command and add a BYTE(n) or CHAR(n) to the end of the DEFINE command field list
  • In MultiLoad and TPump, use the FORMAT UNFORMAT option on the IMPORT command and add a BYTE(n) or CHAR(n) to the end of the layout using the FILLER or FIELD command
  • In FastExport, use the FORMAT UNFORMAT option on the IMPORT command and add a BYTE(n) or CHAR(n) to the end of the layout using the FILLER or FIELD command and/or use the FORMAT UNFORMAT option on the EXPORT command
  • In TPT, use the Format = 'Unformatted' Data Connector Operator attribute and add a BYTE(n) or CHAR(n) to the end of the schema

Other alternatives include:

  • Changing applications that generate text files to be read by Teradata utilities to generate one of the other supported file formats
  • Changing applications that read text files generated by Teradata utilities to accept one of the other supported file formats

References

Additional information on the Teradata utilities mentioned above may be found in:

  • Basic Teradata Query Reference, B035-2414
  • Teradata FastLoad Reference, B035-2411
  • Teradata MultiLoad Reference, B035-2409
  • Teradata Parallel Data Pump Reference, B035-3021
  • Teradata FastExport Reference, B035-2410
  • Teradata Parallel Transporter Reference, B035-2436
  • Teradata Parallel Transporter User Guide, B035-2445

Additional information related to NULL may be found in the Manipulating Nulls topic in chapter 4 (SQL Data Handling) of:

  • Teradata Database SQL Fundamentals, B035-1141

Additional information on external data representation may be found in:

  • Teradata Database SQL Data Types and Literals, B035-1143
1 REPLY
Enthusiast

Re: Why binary information and text files don't mix

we generate some simple reports for mgmt with BTEQ.
In windows, the resulting file from BTEQ appears to have .
We moved the BTEQ script to Unix (AIX).
The report is emailed from AIX to a Windows client.
The Windows user tries to open the report with Notepad.
The user sees that BTEQ on AIX just used NewLines.
Is there a setting in BTEQ to supply or do
we need to add those characters to the end of each
record when creating the report ?

Thanks.
-- Mike G.