What's a BOM and Why do I Care?

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
bwb
Teradata Employee

What's a BOM and Why do I Care?

The Unicode™ standard defines five encodings (the first three encodings are currently supported by Teradata):

  • UTF-8
  • UTF-16 BE (big endian)
  • UTF-16 LE (little endian)
  • UTF-32 BE (big endian)
  • UTF-32 LE (little endian)

When examining a character data file, and absent any external information, it may not be obvious which, if any, Unicode encoding has been used for the file. It may be possible to programmatically determine the Unicode encoding (if any), but there are some pathological cases where such logic will fail. Enter the BOM.

Enter the Unicode Byte Order Mark

The Unicode byte order mark (BOM) acts as a file signature to determine:

  • Whether or not a file is Unicode encoded
  • Which Unicode encoding form is being used
  • The endianness of a UTF-16 or UTF-32 encoding

The BOM consists of the initial two (UTF-16), three (UTF-8), or four (UTF-32) bytes of the file. It is important to note that the BOM is file-oriented; it should not be used to specify the encoding of a line of text, a record, a character string, or a database table column. Also, a BOM should only be used for pure character files; it should not be used for files that have any binary information (in the case of Teradata, this means that a BOM should only be used for TEXT or VARTEXT (delimited data) data files, or for utility scripts).

The BOM is actually the encoding of U+FEFF ZERO-WIDTH NON-BREAKING SPACE (ZWNBSP). A ZWNBSP not at the start of a file (which has been deprecated) is to be interpreted as U+2060 WORD JOINER.



BOM values for each Unicode encoding
Initial Bytes Encoding Form
00 00 FE FF UTF-32BE
FF FE 00 00 UTF-32LE
FE FF UTF-16BE
FF FE UTF-16LE
EF BB BF UTF-8

As should be obvious, the initial byte sequences are very unlikely to appear in a "vanilla" (e.g., ASCII) text file, so the presence of any of the initial byte sequences is a very good indication that the (character) file is Unicode-encoded.

Adding a Byte Order Mark to a File

Some applications (including certain Teradata client products) will insert a BOM when generating character data files and/or scripts. If a BOM is desired (or required), but the generating application doesn't provide it, the simplest way, at least on a Windows platform, is to edit the script or character data file with Notepad.

First, open the file in Notepad, specifying the encoding of the file in the Open dialog box (in the Encoding drop-down list, ANSI means ASCII, Unicode means UTF-16LE, and Unicode big endian means UTF-16BE):

Notepad Open dialog box

Then, save the file using Notepad Save As... (not Save):

Notepad Save As dialog box

If ANSI is chosen as the encoding, the saved file will not have a BOM; if any encoding other than ANSI is chosen, the saved file will have the appropriate BOM. In either case, the file will be saved in the specified encoding.

If the encoding of an existing file that has a BOM needs to be changed, simply open the file in Notepad without specifying the encoding, and then use Save As with the desired encoding specified.

Teradata Client and UTF-16 Endianness

Teradata client products generally require that the endianness of a UTF-16 encoded file match the native endianness of the client platform (for example, big endian for mainframe and SPARC and little endian for Intel). In cases where the UTF-16 endianness of a character data file or a script doesn't match the platform endianness, the Notepad technique described above can be used to switch the endianness.

More Unicode-related information can be found in the individual Teradata client product manuals. There are also some product-specific articles here on the Teradata Developer Exchange that discuss Unicode BOM issues (search for "BOM").

Unicode Resources

The Unicode Consortium's web site (http://www.unicode.org) is highly recommended. It contains a huge amount of authoritative information on Unicode, ranging from overviews to history to incredible amounts of detailed information. This is also the repository of the Unicode standard, for which the Consortium is responsible. There are discussions of the byte order mark, of course (see, for example, http://www.unicode.org/faq/utf_bom.html).

Additional Unicode overview information can be found on Wikipedia:

Hardcopy Unicode references include:

  • The Unicode Standard, Version 5.0, The Unicode Consortium (Boston, MA, Addison-Wesley, 2007, ISBN–10 0–321–48091–0, ISBN–13 978–0–321–48091–0) [the current Unicode version is 6.0; 5.0 is the last version of the standard for which a hardcopy edition was created]

  • Unicode Explained, Jukka K. Korpela (Sebastopol, CA, O’Reilly, 2006, ISBN–10 0–596–10121–X, ISBN–13 978–0–596–10121–3) [a combination tutorial and reference]

1 REPLY
Enthusiast

Re: What's a BOM and Why do I Care?

Hi Britton..
Thanks for an excellent article on BOM..