TPump macrocharset support

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

TPump macrocharset support

TPump macrocharset support

TPump now forces CHARSET internally when building its macros! This feature is new starting in TPump 13.10.00.03 release.

Symptom description

TPump jobs can, in some cases, receive character set translation errors 6706 unnecessarily. The most common case would be a non-Latin client session character set, a target column with CHARACTER SET UNICODE, and a CREATE USER setting of CHARACTER SET LATIN (there are other combinations).

Here is a TPump example; LATIN1250_1A0 is used as the run-time client session charset name.

Example 1:

.LOGTABLE <LOGTABLENAME>;

.LOGON <TDPID>/<USERID>, <PASSWORD>;

DROP TABLE TPTBL;

DROP TABLE TPERR;

CREATE TABLE TPTBL, FALLBACK (

F1 INTEGER,

F2 CHAR(10) CHARACTER SET UNICODE NOT CASESPECIFIC,

F3 VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC,

F4 VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC )

UNIQUE PRIMARY INDEX (F1);

.BEGIN LOAD

SESSIONS 4

NOMONITOR

ERRORTABLE TPERR;

.LAYOUT LAY0;

.FIELD FF1 * VARCHAR(5) KEY;

.FIELD FF2 * VARCHAR(10);

.FIELD FF3 * VARCHAR(30);

.FIELD FF4 * VARCHAR(50);

.DML LABEL LABEL0;

INSERT TPTBL0846(:FF1,:FF2,:FF3,:FF4);

.IMPORT INFILE <DATAFILENAME>

FORMAT VARTEXT '|' NOSTOP DISPLAY ERRORS

LAYOUT LAY0

APPLY LABEL0;

.END LOAD;

.LOGOFF;

Here is part of the data file:

245|õ|_Q|LATIN SMALL LETTER O WITH DOUBLE ACUTE|

246|ö||LATIN SMALL LETTER O WITH DIAERESIS|

247|÷||DIVISION SIGN|

248|ø|_Y|LATIN SMALL LETTER R WITH CARON|

249|ù|_o|LATIN SMALL LETTER U WITH RING ABOVE|

250|ú||LATIN SMALL LETTER U WITH ACUTE|

251|û|_q|LATIN SMALL LETTER U WITH DOUBLE ACUTE|

252|ü||LATIN SMALL LETTER U WITH DIAERESIS|

253|ý||LATIN SMALL LETTER Y WITH ACUTE|

254|þ|_c|LATIN SMALL LETTER T WITH CEDILLA|

255|ÿ|_Ù|DOT ABOVE|

The above example would result in one or more “6706 The string contains an untranslatable character." errors.

Background

Client vs. server character sets

The Teradata RDBMS deals with two types of character sets:

  • Client session character sets, used for (essentially all) transmission of character information (data, request text, metadata, etc.) between client and Teradata server. The client session character set is specified by the client application on a request-level basis (although typically it is static for an entire session).
  • Server storage character sets, used to store character data (and metadata) on the Teradata server. The server storage character set is either specified by the user or defaulted for each character column/parameter (see below for the explanations).

The Teradata RDBMS is responsible for all translation between the client session character set and the server storage character set and vice versa.

DEFAULT CHARACTER SET clause

The Teradata RDBMS allows the DBA to include a DEFAULT CHARACTER SET clause when creating a new user (or when modifying an existing user). In the absence of a DEFAULT CHARACTER SET clause at CREATE USER time, DEFAULT CHARACTER SET LATIN (for non-Japanese machines) or DEFAULT CHARACTER SET UNICODE (for Japanese machines) is defaulted.

The server storage character set specified in (or defaulted to) the DEFAULT CHARACTER SET clause is used in two cases (one well known, one obscure):

  • When a character table column is defined (either at CREATE TABLE or at ALTER TABLE ADD column time), if no CHARACTER SET clause is provided for the column, the user's DEFAULT CHARACTER SET is used to specify the server storage character set.
  • When a character macro parameter is defined (CREATE/REPLACE MACRO), if no CHARACTER SET clause is provided for the parameter, the user's DEFAULT CHARACTER SET is used to specify the intermediate macro storage character set.

TPump use of macros

TPump creates a macro for each DML statement (or UPDATE/INSERT pair when DO INSERT ON MISSING UPDATE is used) specified in the job. This simplifies generation of the data loading requests, reduces the size of the request text, and may increase the likelihood of dictionary caching.

The Problem

Normally, the translation of incoming character data from client session character set to server storage character set(s) is done in a single step and is straightforward. However, the necessary use of macros by TPump results in two translations for each column's data:

  • First, from the client session character set to the intermediate macro storage character set, and
  • Second, from the intermediate macro storage character set to the server storage character set.

TPump generates CHARACTER SET clauses on character parameters for the macros it generates only if the client session character set is UTF-16.

Given the following scenario:

  • Client session character set of LATIN1250_1A0 (central/eastern European), which includes characters not represent-able in Teradata Latin
  • User DEFAULT CHARACTER SET LATIN
  • Server storage character set of UNICODE for a target character column

The intermediate macro storage character set will be LATIN, and if there are any characters in the character input data outside of the Teradata Latin repertoire, a translation error will occur during the translation from LATIN1250_1A0 to Teradata LATIN.

More generally, any time the client session character set include characters in its repertoire that are not included in the repertoire of the DEFAULT CHARACTER SET, translation errors will occur if the character input data contains characters not represent-able in the DEFAULT CHARACTER SET. This is true even if the server storage character set for the column in question can represent all characters in the input data.

Note that if the server storage character set cannot represent one or more characters in the input data, a translation error will occur when attempting the translation into the server storage character set. For example, the client session character set is UTF8 and the server storage character set is Teradata LATIN, and characters not represent-able in Teradata LATIN occur in the input data, a translation error will occur. The solution outlined below does not address this scenario, since it is a user error, is not restricted to TPump, and must be addressed by the user (either by changing the server storage character set or by restricting the character repertoire used in the input data).

Unpalatable Workarounds

The two more-or-less obvious workarounds are both unpalatable and non-user-friendly:

  • Change the user's DEFAULT CHARACTER SET setting to UNICODE.
  • For each TPump job:

          Run the job with no data, saving the macros.

   Hand-modify the macros to specify CHARACTER SET UNICODE for each character parameter.

         Change the job to EXECUTE the saved macros.

The Solution

The solution to the problem described above is an expansion of the technique already used by TPump for UTF-16 client session character set. In particular, TPump must generate CHARACTER SET clauses for character macro parameters in such a way as to ensure that the intermediate macro storage character set can represent all possible characters in the input data, based on the client session character set.

The particular CHARACTER SET clauses generated can be controlled by the user, either on an input field basis or globally for a job. Absent user specification, TPump will generate fail-safe (although potentially non-optimal) CHARACTER SET clauses.

Default Behavior

Absent user specification, the following clause will be generated for each character macro parameter:

  • CHARACTER SET LATIN, when the client session character set is:
    • ASCII
    • EBCDIC
    • EBCDIC037_0E
    • EBCDIC277_0E
    • EBCDIC273_0E
    • LATIN1_0A
    • LATIN9_0A
    • LATIN1252_0A
    • Any character set whose name ends in _0A or _0E
    • Any character set whose name ends in _zx, where "z" is other than 0 (digit zero)
    • Any character set whose name does not end in _zx or _zxx, except for KATAKANAEBCDIC, UTF8, or UTF16

The eight explicitly enumerated character sets, and character sets with names conforming to the name rules in the last two bullets, must, by definition, have character repertoires that are subsets of Teradata LATIN

  • CHARACTER SET UNICODE, for all other client session character sets, including:
    • UTF8
    • UTF16
    • KATAKANAEBCDIC
    • KANJIEBCDIC5026_0I
    • KANJIEBCDIC5035_0I
    • KANJISJIS_0S
    • KANJIEUC_0U
    • KANJI932_1S0
    • SCHEBCDIC935_2IJ
    • TCHEBCDIC937_3IB
    • SCHGB2312_1T0
    • TCHBIG5_1R0
    • SCHINESE936_6R0
    • TCHINESE950_8R0
    • HANGULEBCDIC933_1II
    • HANGULKSC5601_2R4
    • KANGUL94_7R0
    • THAI874_4A0
    • LATIN1250_1A0
    • CYRILLIC1251_2A0
    • LATIN1252_3A0
    • LATIN1254_7A0
    • HEBREW1255_5A0
    • ARABIC1256_6A0
    • LATIN1258_8A0
    • Any character set whose name ends in _0x, where "x" is other than "A" or "E"
    • Any character set whose name ends in _zx or _zxx, except for _zx cases noted above

After the solution is implemented in TPump, above mentioned TPump example 1 now loads the data successfully.

User Specification/New Syntax

The user can specify the intermediate macro storage character set at the input field level, at the global layout level, or both:

  • For individual character fields, the intermediate macro storage character set can be specified:

In the .FIELD command, an optional MACROCHARSET <server-character-set> clause can be appended to the datadesc operand.

  • For all character fields, a global intermediate macro storage character set can be specified:

In the .LAYOUT command, an optional MACROCHARSET <server-character-set> clause can follow the layoutname operand.

The semantics of the specifications for each character field are:

  • If a field-level MACROCHARSET <field-server-character-set> clause is specified for a given field, a CHARACTER SET <field-server-character-set> clause is generated for the corresponding character macro parameter.
  • If a global layout-level MACROCHARSET <global-server-character-set> clause is specified, a CHARACTER SET <server-character-set> clause is generated for the each character macro parameter for which there is no field-level MACROCHARSET <field-server-character-set> clause.
  • If no global layout-level MACROCHARSET <global-server-character-set> clause is specified, then the default CHARACTER SET LATIN or CHARACTER SET UNICODE clause is generated for the each character macro parameter for which there is no field-level MACROCHARSET <field-server-character-set> clause.

The intent of the default behavior above is to ensure that no character-repertoire-related translation errors will occur during translation from client session character set to the intermediate macro storage character set. Although not necessarily the most efficient in terms of storage usage, this is a conservative (safe) algorithm; users can override the default behavior as appropriate/desired.

The <server-character-set> refers to server storage character set name. Refer to the following two documents for the valid server storage character set names:

  • Chapter 4 of the International Character Set Support document (B035-1125)
  • CREATE USER statement (DEFAULT CHARACTER SET discussion) in Chapter 6 of the SQL Data Definition Language document (B035-1144).

Example 2:

.LOGTABLE <LOGTABLENAME>;

.LOGON <TDPID>/<USERID>, <PASSWORD>;

DROP TABLE TPTBL1;

DROP TABLE TPERR1;

CREATE MULTISET TABLE TPTBL1, FALLBACK,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = Default

(

id INTEGER,

Description CHAR(50) CHARACTER SET KANJI1,

HexOrg VARCHAR(10) CHARACTER SET KANJI1,

HexUni VARCHAR(10) CHARACTER SET KANJI1,

HexKnji1 VARCHAR(10) CHARACTER SET KANJI1,

CharData VARCHAR(10) CHARACTER SET KANJI1

)

UNIQUE PRIMARY INDEX (id);

.BEGIN LOAD SESSIONS 22

PACK 10

NOMONITOR

ERRLIMIT 500

ROBUST ON

SERIALIZE OFF

CHECKPOINT 10

ERRORTABLE TPERR1;

.LAYOUT lay1a INDICATORS MACROCHARSET KANJI1;

.FIELD fld1 * INTEGER;

.FIELD fld2 * CHAR(100) ;

.FIELD fld3 * VARCHAR(20);

.FIELD fld4 * VARCHAR(20);

.FIELD fld5 * VARCHAR(20);

.FIELD fld6 * VARCHAR(20);

.DML LABEL labela USE (fld1, fld2, fld3, fld4, fld5, fld6);

INSERT INTO TPTBL1(Id, Description, HexOrg, HexUni, HexKnji1, CharData)

VALUES(:fld1, :fld2, :fld3, :fld4, :fld5, :fld6);

.IMPORT INFILE <DATAFILENAME>

FORMAT fastload

LAYOUT lay1a

APPLY labela;

.END LOAD;

.LOGOFF;

Here is an example to demonstrate how to specify field-level MACROCHARSET info; KANJISJIS_0S is used as the run-time client session charset name.

Example 3:

.LOGTABLE <LOGTABLENAME>;

.LOGON <TDPID>/<USERID>, <PASSWORD>;

DROP TABLE TPTBL1;

DROP TABLE TPERR1;

CREATE MULTISET TABLE TPTBL1, FALLBACK,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = Default

(

id INTEGER,

Description CHAR(50) CHARACTER SET LATIN,

HexOrg VARCHAR(10) CHARACTER SET LATIN,

HexUni VARCHAR(10) CHARACTER SET LATIN,

HexKnji1 VARCHAR(10) CHARACTER SET LATIN,

CharData VARCHAR(10) CHARACTER SET KANJI1

)

UNIQUE PRIMARY INDEX (id);

.BEGIN LOAD SESSIONS 22

PACK 10

NOMONITOR

ERRLIMIT 500

ROBUST ON

SERIALIZE OFF

CHECKPOINT 10

ERRORTABLE TPERR1;

.LAYOUT lay1a INDICATORS;

.FIELD fld1 * INTEGER;

.FIELD fld2 * CHAR(100) MACROCHARSET LATIN;

.FIELD fld3 * VARCHAR(20) MACROCHARSET LATIN;

.FIELD fld4 * VARCHAR(20) MACROCHARSET LATIN;

.FIELD fld5 * VARCHAR(20) MACROCHARSET LATIN;

.FIELD fld6 * VARCHAR(20) MACROCHARSET KANJI1;

.DML LABEL labela USE (fld1, fld2, fld3, fld4, fld5, fld6);

INSERT INTO TPTBL1(Id, Description, HexOrg, HexUni, HexKnji1, CharData)

VALUES(:fld1, :fld2, :fld3, :fld4, :fld5, :fld6);

.IMPORT INFILE <DATAFILENAME>

FORMAT fastload

LAYOUT lay1a

APPLY labela;

.END LOAD;

.LOGOFF;

Here is an example to demonstrate how to specify a global layout-level MACROCHARSET specification in the .LAYOUT command, and then override the setting on one or more .FIELD commands; KANJISJIS_0S is used as the run-time client session charset name.

Example 4:

.LOGTABLE <LOGTABLENAME>;

.LOGON <TDPID>/<USERID>, <PASSWORD>;

DROP TABLE TPTBL1;

DROP TABLE TPERR1;

CREATE MULTISET TABLE TPTBL1, FALLBACK,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = Default

(

id INTEGER,

Description CHAR(50) CHARACTER SET LATIN,

HexOrg VARCHAR(10) CHARACTER SET LATIN,

HexUni VARCHAR(10) CHARACTER SET KANJI1,

HexKnji1 VARCHAR(10) CHARACTER SET KANJI1,

CharData VARCHAR(10) CHARACTER SET KANJI1

)

UNIQUE PRIMARY INDEX (id);

.BEGIN LOAD SESSIONS 22

PACK 10

NOMONITOR

ERRLIMIT 500

ROBUST ON

SERIALIZE OFF

CHECKPOINT 10

ERRORTABLE TPERR1;

.LAYOUT lay1a INDICATORS MACROCHARSET LATIN;

.FIELD fld1 * INTEGER;

.FIELD fld2 * CHAR(100);

.FIELD fld3 * VARCHAR(20);

.FIELD fld4 * VARCHAR(20) MACROCHARSET KANJI1;

.FIELD fld5 * VARCHAR(20) MACROCHARSET KANJI1;

.FIELD fld6 * VARCHAR(20) MACROCHARSET KANJI1;

.DML LABEL labela USE (fld1, fld2, fld3, fld4, fld5, fld6);

INSERT INTO TPTBL1(Id, Description, HexOrg, HexUni, HexKnji1, CharData)

VALUES(:fld1, :fld2, :fld3, :fld4, :fld5, :fld6);

.IMPORT INFILE <DATAFILENAME>

FORMAT fastload

LAYOUT lay1a

APPLY labela;

.END LOAD;

.LOGOFF;

Special thanks to Bruce Britton’s elaboration for this TPump specific issue and guideline to implement this feature.

Tags (2)