Teradata Parallel Transporter Unicode Usage

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Teradata Parallel Transporter Unicode Usage

This article describes usage tips on how to load/unload Unicode data with the UTF8 and UTF16 Teradata client session character sets using Teradata Parallel Transporter (TPT).

As of this writing, Teradata Parallel Transporter supports Unicode only on network-attached platforms. 

What is Unicode?

Unicode is an industry standard designed to allow text and symbols from all languages to be consistently represented. Unicode characters, each identified by an unambiguous name and an integer number called its code point, can be encoded using any of several schemes termed Unicode Transformation Formats (UTF).

Unicode encodings include:

• UTF-8 – an 8-bit, variable character-width encoding, compatible with 7-bit ASCII

• UCS-2 – a 16-bit, fixed character-width encoding

• UTF-16 – a 16-bit (or 32-bit surrogate pairs) variable character-width encoding

• UTF-32 – a 32-bit fixed character-width encoding

With the exception of UCS-2, all Unicode encoding forms contain the same character set repertoire; only the encodings differ between the Unicode Transformation Formats.

Character Set Encodings

ASCII

7-bit ASCII characters are one-byte characters using only 7 bits per character.

1-byte character: 0xxxxxxx

8-bit ASCII, also called “extended ASCII” or “high ASCII” describes eight-bit character encodings that include the standard 7-bit ASCII as well as others.

1-byte character: xxxxxxxx

 

ANSI

ANSI is a general definition for code pages. These can be one byte per character (example: Windows 1252) or multiple bytes per character (example: Shift JIS).

1-byte character: xxxxxxxx
2-byte character: xxxxxxxx xxxxxxxx

 

UTF-8

UTF-8 is a variable-length encoding for Unicode. It is able to represent any universal character in the Unicode standard, yet is also backwards compatible with 7-bit ASCII. In other words, UTF-8 is a superset of 7-bit ASCII. A plain 7-bit ASCII string is also a valid UTF-8 string. This backwards-compatibility means that no conversion needs to be done for 7-bit ASCII text and existing software based on 7-bit ASCII and its extensions can handle UTF-8.

The default Unicode character encoding form on UNIX platforms is UTF-8. UTF-8 works on ANSI single byte character systems without any need of modifications. 7-bit ASCII characters use one byte and all other characters use two or more bytes.

This encoding is also widely used on the Internet for transmitting Unicode text.

UTF-8 uses one to four bytes per character, depending on the Unicode symbol. The first byte of a multi-byte character contains a 1-bit for each byte used by the character followed by a 0-bit, and each of the following bytes of that character start with one 1-bit and one 0-bit.

1-byte character: 0xxxxxxx
2-byte character: 110xxxxx 10xxxxxx
3-byte character: 1110xxxx 10xxxxxx 10xxxxxx
4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

 

UCS-2

UCS-2 is the standard Unicode encoding format used in Win32 environments up to Windows NT. Characters are stored as fixed-length 2-byte characters, where the high-order byte contains zeros if the character is an ASCII (7-bit) or ANSI (8-bit) character.

2-byte character: 00000000 xxxxxxxx
2-byte character: xxxxxxxx xxxxxxxx

 

UTF-16

UTF-16 is the extended UCS-2 encoding format. This is the default encoding format on Microsoft Windows 2000 and XP. UTF-16 includes the UCS-2 character repertoire, but has been extended to handle two 16-bit values (called surrogate pairs) forming one character.

Surrogate pairs is a mechanism for encoding more than the 2^16 characters available in UCS-2 and UTF-16 before Unicode 3.1. This extension mechanism allows for more than one million additional characters.

This is accomplished by using two 16-bit values (surrogates) to represent one character. Each of the two surrogates can have one of 1024 different values, which gives 1024^2 new character values.

The first 16-bit has a value in the D800-DBFF range, called high surrogate. The second 16-bit has a value in the DC00-DFFF range, called low surrogate. In UCS-2, these 2*1024 character positions are reserved as the private usage character space.

2-byte character: 00000000 xxxxxxxx
2-byte character: xxxxxxxx xxxxxxxx
4-byte character: 110110xx xxxxxxxx 110111xx xxxxxxxx
High Surrogate Low Surrogate

Specifying Character Sets

Prior to Unicode support for TPT, the architecture for specifying character sets is that all of the following must be in the same character set:

• TPT job script
• Client session character set
• Data

For example, to load KANJISJIS_0S data:

• Job script must be encoded in KANJISJIS_0S
• Job script must specify the USING CHARACTER SET KANJISJIS_0S client session character set clause
• The data must be in KANJISJIS_0S

It turns out that if the job script does not contain extended characters, then the job script could also be encoded in ASCII – which makes sense since the lower 7-bits are the same.

With support for UTF-16, however, there may be situations where users want their job script encoded in UTF-8 and the data in UTF-16 (along with the client session character set UTF16); or vice-versa.

To accommodate this, TPT will adopt the following architecture for specifying job script encoding and for specifying the client session character set when using UTF-16.

 

Client Session Character Set, SQL Request Text, & Data

TPT will maintain the Teradata DBS requirement that the SQL request text and all character data must be in the same client session character set.

 

Job Script Encoding

A job script encoded in UTF-16 must be specified via a command line argument. This is necessary because TPT will (by default) expect job scripts that are encoded in a 7-bit ASCII-compatible character set.

 

Job Variables / INCLUDE Directive

TPT allows job variables and INCLUDE directives to be located in an external file. These job variables and directives get substituted into the TPT script at compile time by the TPT Preprocessor. TPT will maintain the requirement that these external files must be in a character set that is compatible with the character set in which the job script is encoded. 

Unicode Job Scenarios in TPT

There are four scenarios for UTF-8 & UTF-16 job script & data encoding. They are outlined below.

Scenario 1: UTF-8 Job Script w/ UTF-8 Data

The following must be specified:

• Job script must be encoded in UTF-8
• Job script must specify the USING CHARACTER SET UTF8 client session character set clause
• Data must be in UTF-8

 

Scenario 2: UTF-8 Job Script w/ UTF-16 Data

The following must be specified: 

• Job script must be encoded in UTF-8
• Job script must specify the USING CHARACTER SET UTF16 client session character set clause
• Data must be in UTF-16

The endianness of the UTF-16 data must be the native endianness for the hardware platform on which TPT is running.

 

Scenario 3: UTF-16 Job Script w/ UTF-8 Data

The following must be specified: 

• Job script must be encoded in UTF-16
• Command line argument must specify –e UTF16
• Job script must specify the USING CHARACTER SET UTF8 client session character set clause
• The data must be in UTF-8

The endianness of the UTF-16 job script must be the native endianness for the hardware platform on which TPT is running if –e UTF16 is specified.

 

Scenario 4: UTF-16 Job Script w/ UTF-16 Data

The following must be specified:

• Job script must be encoded in UTF-16
• Command line argument must specify –e UTF16
• Job script must specify the USING CHARACTER SET UTF16 client session character set clause
• The data must be in UTF-16

The endianness of the UTF-16 job script must be the native endianness for the hardware platform on which TPT is running if –e UTF16 is specified.

The endianness of the UTF-16 data must be the native endianness for the hardware platform on which TPT is running.

6 REPLIES

Re: Teradata Parallel Transporter Unicode Usage

Hey...

It was very useful and as testing i ran TPT with few chinease char, but when i ran the TPT Job script with UTF16 and use below command to run.

1. tbuild -f TPT_UTF.txt test_load -e UTF8 job loaded data sucessfully to target.

2. tbuild -f TPT_UTF.txt test_load -e UTF16 job completed sucessfully but all chinease char loaded into Error table.

Any reason
Teradata Employee

Re: Teradata Parallel Transporter Unicode Usage

Hi, this would depend on what your Client Session Character Set is. As stated above, that the Client Session Character Set is defined inside the TPT script via the USING CHARACTER SET clause. If no such statement is specified, then the default Client Session Character Set is ASCII (network) or EBCDIC (mainframe).

Remember, the -e command line option only specifies the encoding of the TPT job script -- not the Client Session Character Set.
Enthusiast

Re: Teradata Parallel Transporter Unicode Usage

Hi, I want to use TPT in one of my project and I am facing the following issue,

table DDL
--------
Id DECIMAL(18,0) TITLE 'Identifier' NOT NULL,
Vendor_Id VARCHAR(10) CHARACTER SET LATIN CASESPECIFIC TITLE 'Vendor Identifier' NOT NULL,
Name VARCHAR(4000) CHARACTER SET UNICODE NOT CASESPECIFIC TITLE 'Name',
Content_Name VARCHAR(600) CHARACTER SET UNICODE NOT CASESPECIFIC FORMAT 'X(300)' TITLE 'Content Name',

In TPT DEFINE SCHEMA all are casted to appropriate VARCHAR(n)
In EXPORT_OPERATOR SQL all four fields are casted to VARCHAR(n)

I am using FILE_WRITER operator to write data into delimited file.

Scenario 1: - When I run script with following command, the script runs fine but in the output file it does not show up the Chinese unicode characters. ( it does not have USING CHAR SET UTF8 DEFINE JOB MOVE_DATA_TO_FLAT_FILE)
tbuild -f

Scenario 2:- When I put USING CHAR SET UTF8 DEFINE JOB MOVE_DATA_TO_FLAT_FILE, and run thru tbuild -f it gives me an error EXPORT_OPERATOR: TPT12108: Output Schema does not match data from SELECT statement

My script is written in ASCII character set (all English) and I am using AIX machine to write a script. I do not have windows script generator tool.

How do I configure my script so that it gives me UTF8 characters in the file?

Teradata Employee

Re: Teradata Parallel Transporter Unicode Usage

Hi, when specifying the USING CHAR SET UTF8 session character set specification, you need to triple the lengths of your CHAR and VARCHAR columns as defined in the TPT script.

Try that?
Enthusiast

Re: Teradata Parallel Transporter Unicode Usage

Could you explain please ? Why de we need to triple UTF8 varchar schemas? UTF8 is a variable byte length 1-4 bytes ,so the maximum could be 4 bytes. Thanks.
Teradata Employee

Re: Teradata Parallel Transporter Unicode Usage

The reason for tripling UTF8 varchar columns is because Teradata does not currently support 4-byte UTF8 characters.