The Unicode™ standard defines five encodings (the first three encodings are currently supported by Teradata):
When examining a character data file, and absent any external information, it may not be obvious which, if any, Unicode encoding has been used for the file. It may be possible to programmatically determine the Unicode encoding (if any), but there are some pathological cases where such logic will fail. Enter the BOM.
The Unicode byte order mark (BOM) acts as a file signature to determine:
The BOM consists of the initial two (UTF-16), three (UTF-8), or four (UTF-32) bytes of the file. It is important to note that the BOM is file-oriented; it should not be used to specify the encoding of a line of text, a record, a character string, or a database table column. Also, a BOM should only be used for pure character files; it should not be used for files that have any binary information (in the case of Teradata, this means that a BOM should only be used for TEXT or VARTEXT (delimited data) data files, or for utility scripts).
The BOM is actually the encoding of U+FEFF ZERO-WIDTH NON-BREAKING SPACE (ZWNBSP). A ZWNBSP not at the start of a file (which has been deprecated) is to be interpreted as U+2060 WORD JOINER.
|Initial Bytes||Encoding Form|
|00 00 FE FF||UTF-32BE|
|FF FE 00 00||UTF-32LE|
|EF BB BF||UTF-8|
As should be obvious, the initial byte sequences are very unlikely to appear in a "vanilla" (e.g., ASCII) text file, so the presence of any of the initial byte sequences is a very good indication that the (character) file is Unicode-encoded.
Some applications (including certain Teradata client products) will insert a BOM when generating character data files and/or scripts. If a BOM is desired (or required), but the generating application doesn't provide it, the simplest way, at least on a Windows platform, is to edit the script or character data file with Notepad.
First, open the file in Notepad, specifying the encoding of the file in the Open dialog box (in the Encoding drop-down list, ANSI means ASCII, Unicode means UTF-16LE, and Unicode big endian means UTF-16BE):
Then, save the file using Notepad Save As... (not Save):
If ANSI is chosen as the encoding, the saved file will not have a BOM; if any encoding other than ANSI is chosen, the saved file will have the appropriate BOM. In either case, the file will be saved in the specified encoding.
If the encoding of an existing file that has a BOM needs to be changed, simply open the file in Notepad without specifying the encoding, and then use Save As with the desired encoding specified.
Teradata client products generally require that the endianness of a UTF-16 encoded file match the native endianness of the client platform (for example, big endian for mainframe and SPARC and little endian for Intel). In cases where the UTF-16 endianness of a character data file or a script doesn't match the platform endianness, the Notepad technique described above can be used to switch the endianness.
More Unicode-related information can be found in the individual Teradata client product manuals. There are also some product-specific articles here on the Teradata Developer Exchange that discuss Unicode BOM issues (search for "BOM").
The Unicode Consortium's web site (http://www.unicode.org) is highly recommended. It contains a huge amount of authoritative information on Unicode, ranging from overviews to history to incredible amounts of detailed information. This is also the repository of the Unicode standard, for which the Consortium is responsible. There are discussions of the byte order mark, of course (see, for example, http://www.unicode.org/faq/utf_bom.html).
Additional Unicode overview information can be found on Wikipedia:
Hardcopy Unicode references include: