A Byte Order Mark (BOM) is the Unicode character used to denote the endianness of a text file or stream. This article will explore how the different Teradata Standalone Utilities (FastLoad, MultiLoad, TPump, FastExport) handle this within both their Job Scripts and their Data files.
The code point of Byte Order Mark is U+FEFF and it is converted into different representations by an encoding form:
| UTF-8 || EF BB BF |
(Note: This BOM does not determine endianness, since UTF-8 is considered a byte stream, not numeric code points.)
| UTF-16 Big Endian || FE FF |
| UTF-16 Little Endian || FF FE |
| UTF-32 Big Endian || 00 00 FE FF |
| UTF-32 Little Endian || FF FE 00 00 |
The Byte Order Mark provides a way for an application to determine in which specific encoding form a Unicode file is written. It also specifies the endianness of that file with respect to UTF-16 and UTF-32. A Byte Order Mark takes up two to four bytes at the beginning of the file, however, it is typically not displayed. It is also only found in pure character files as it is invalid for mixed character/binary data files. A Byte Order Mark is optional and is considered a “zero-width non-breaking space”. (Refer to 541-0006136 A03)
The Teradata standalone Utilities (FastLoad, MultiLoad, Tpump and FastExport) currently support UTF-8 and UTF-16 Byte Order Marks in the job script file and/or in the data file. At the time of this writing, Teradata DBMS does not currently support the UTF-32 session character set.
The user can specify the encoding form of the job script using the “-i scriptencoding” runtime parameter. The available input encoding options are:
The UTF-16 or UTF-8 Byte Order Mark can be present or absent in the job script file.
When the user specifies the runtime parameter “-i UTF-16BE” or “-i UTF-16LE” or “-i UTF-16”, the utilities try to detect FE FF or FF FE at the beginning of text stream.
For the following two situations:
UTF-16BE encoding FE FF is detected, but “-i UTF-16LE” is specified
UTF-16LE encoding FF FE is detected, but “-i UTF-16BE” is specified
There is a mismatch between the user-expected text stream encoding and the actual text stream encoding. Error message (UTY2420) will be issued to indicate that the Byte Order Mark in the input file conflicts with the endianness specified by the runtime parameter.
When the user specifies the runtime parameter “-i UTF-8”, the utilities try to detect EF BB BF at the beginning of the text stream.
UTF-8 can contain a BOM, however, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of UNIX shell scripts.
Whilw the utilities support the UTF-8 session character set on the z/OS platform, having s UTF-8 BOM in the script file is not supported on the z/OS platform.
The user can specify the encoding form of the job output using the “-u outputencoding” runtime parameter. The available output encoding options are:
At the time of this writing, the UTF16 BOM is not printed as a part of job output. Also, if using the ROUTE MESSAGES command to redirect the output messages to a file, the UTF16 BOM is not printed as a part of output messages.
The behavior has no issues when viewing the result on the console. However, the issue arises when redirecting the output to a result file. As there is no UTF16 BOM character written to the result file, if the encoding form of the job output doesn’t match the platform default endianness, most text editors are not able to process the result file properly (on Windows the contents of the result file can only be viewed using NotePad while on Unix it is necessary to use the cat utility).
For example: If we run the following Multiload test on a Windows machine:
mload -u utf-16le -i utf16le -c utf-16 < test.ml > test.res 2>&1
We will get "test.res” result file encoded in UTF16 little-endian and we can view the contents using "NOTEPAD".
However, when we open the file using WORDPAD or any third party text editors like EditPlus, NotePad++ etc, we will not able to see the result file contents in text format.
If we put the BOM (FF FE) for little-endian encoding as the first character of the file to indicate the endianness, upon opening the file in any text editor, we can view the proper contents of the file.
Based on the above research, MultiLoad will be enhanced to add UTF16 BOM character to the output file when user specifies UTF16 as output encoding, so that the result file can be viewed in any text editor and will be more user friendly. According to the plan, the enhancement will be made in the TTU14.0 release.
Unless the data format is indicated as text, no check is made for BOM characters. This is because only in text format are the control bytes meaningful. In any other format, the beginning bytes are strictly interpreted as binary and may be interpreted as UTF control bytes by coincidence. It is a basic requirement for the user to properly describe the data format so that it can be properly interpreted. As with all required parameters, it is incumbent upon the user to properly define the data format.
For example, consider a situation where we purposely submit a job where there is a mismatch between the session character set and the data file encoding.
Specifically, by specifying the runtime parameters as below:
-c UTF-16 - i UTF-8 –u UTF-8
The session character set is UTF-16, the job script encoding form is UTF-8, and the job output encoding form is UTF-8. If we have a UTF8 encoded data file, there is a mismatch between the session character set and the data file encoding. The following error message is expected:
"I/O Error on File Read: 46, Text: Input data UTF control bytes conflict with requested character set"