CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

Tools
Enthusiast

CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

In TPT script that "USING CHARACTER SET UTF8" in 13.10

We multiply char_length() of a CHAR & VARCHAR column by 3, so for example,

  • CHAR(2) in DDL = CHAR(6) in TPT
  • VARCHAR(20) in DDL = VARCHAR(60) in TPT

It seems that in TPT 14.00 and 14.10, if we try to use UTF8, the following definition works

  • CHAR(2) CHAR SET UNICODE in DDL = CHAR(6) in TPT
  • CHAR(2) CHAR SET LATIN in DDL = CHAR(4) in TPT
  • VARCHAR(20) CHAR SET UNICODE in DDL = VARCHAR(60) in TPT
  • VARCHAR(20) CHAR SET LATIN in DDL = VARCHAR(60) in TPT

The following DDL will only work with the following TPT schema definition:

 CREATE SET TABLE tpt_data_type ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
PRODUCT_ID INTEGER NOT NULL,
ORDER_ID BIGINT NOT NULL,
ORDER_DESC VARCHAR(100) CHARACTER SET UNICODE NOT CASESPECIFIC,
STORE_CODE VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
ACTIVE_FLAG CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
UNICODE_FLAG CHAR(1) CHARACTER SET UNICODE NOT CASESPECIFIC,
ORDER_AMT DECIMAL(15,2),
ORDER_TIME TIMESTAMP(0),
ORDER_STATE CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
UPDATED_TS TIMESTAMP(0))
PRIMARY INDEX ( ORDER_ID );

USING CHARACTER SET UTF8 
DEFINE JOB EXPORT_TO_FASTLOAD_FORMAT
DESCRIPTION 'Export from ' || @SourceTableName || ' to the INDICDATA file: ' || @DataFileName
(
DEFINE SCHEMA DATA_FILE_SCHEMA
(
"PRODUCT_ID" Int,
"ORDER_ID" BigInt,
"ORDER_DESC" Varchar(300), /* 100 x 3 */
"STORE_CODE" Varchar(60), /* 20 x 3 */
"ACTIVE_FLAG" Char(2), /* 1 x 2 */
"UNICODE_FLAG" Char(3), /* 1 x 3 */
"ORDER_AMT" Decimal(15,2),
"ORDER_TIME" Timestamp(0),
"ORDER_STATE" Char(2), /* 1 x 2 */
"UPDATED_TS" Timestamp(0)
);

DEFINE OPERATOR EXPORT_OPERATOR
TYPE EXPORT
SCHEMA DATA_FILE_SCHEMA
ATTRIBUTES
(
VARCHAR PrivateLogName = @SourceTableName || '_log',
VARCHAR TdpId = @TdpId,
VARCHAR UserName = @UserName,
VARCHAR UserPassword = @UserPassword,
VARCHAR QueryBandSessInfo = 'Action=TPT_EXPORT; Format=Fastload;',
VARCHAR SpoolMode = 'noSpool',
INTEGER MaxDecimalDigits = 18,
VARCHAR DateForm = 'INTEGERDATE',
VARCHAR SelectStmt = 'select * from ' || @SourceTableName
);

DEFINE OPERATOR FILE_WRITER
TYPE DATACONNECTOR CONSUMER
SCHEMA *
ATTRIBUTES
(
VARCHAR PrivateLogName = 'indicdata_writor_log',
VARCHAR DirectoryPath = @DataFilePath,
VARCHAR FileName = @DataFileName,
VARCHAR Format = 'Formatted',
VARCHAR OpenMode = 'Write',
VARCHAR IndicatorMode = 'Y'
);

APPLY TO OPERATOR (FILE_WRITER[@DataFileCount])
SELECT * FROM OPERATOR (EXPORT_OPERATOR[@NumOfReader]);
);

$ hexdump -C /var/tmp/tpt_data_type.fastload 
00000000 64 00 00 00 12 27 00 00 15 cd 5b 07 00 00 00 00 |d....'...�[.....|
00000010 19 00 44 55 4d 4d 59 20 44 45 53 43 20 e6 95 99 |..DUMMY DESC �..|
00000020 e5 ad a6 e8 ae be e8 ae a1 20 21 04 00 53 45 41 |学设计 !..SEA|
00000030 32 59 20 e5 85 a8 bf 09 00 00 00 00 00 00 32 30 |2Y �.��.......20|
00000040 31 34 2d 30 32 2d 31 32 20 31 32 3a 31 32 3a 32 |14-02-12 12:12:2|
00000050 32 50 20 32 30 31 34 2d 30 32 2d 31 39 20 31 37 |2P 2014-02-19 17|
00000060 3a 32 37 3a 35 37 0a 64 00 00 00 19 27 00 00 45 |:27:57.d....'..E|
00000070 ef 54 07 00 00 00 00 19 00 44 55 4d 4d 59 20 44 |�T.......DUMMY D|
00000080 45 53 43 20 e5 af bb e6 89 be e9 87 91 e6 98 9f |ESC 寻�.��..�..|
00000090 20 21 04 00 4c 41 58 33 4e 20 e5 85 8d 83 13 00 | !..LAX3N �.....|
000000a0 00 00 00 00 00 32 30 31 34 2d 30 31 2d 31 31 20 |.....2014-01-11 |
000000b0 31 31 3a 31 31 3a 31 31 57 20 32 30 31 34 2d 30 |11:11:11W 2014-0|
000000c0 32 2d 31 39 20 31 37 3a 32 37 3a 35 37 0a |2-19 17:27:57.|
000000ce

The CHAR(n) CHAR SET LATIN behaves differently than previous version. Can someone please confirm?

6 REPLIES
Enthusiast

Re: CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

According to 2436.pdf - TPT Reference

Use of UNICODE Affects Column Width Requirements

When specifying a UTF8 character set in a Teradata PT script the output schema definition must define column widths three times larger.

When using the UTF16 character set in a Teradata PT job script, the output SCHEMA definition must define column widths two times larger. The width values must be an even and positive number.



I don't see anything special for CHAR(n)

Enthusiast

Re: CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

I double-checked the above observation against the auto-generated schema, and it confirms that "CHAR(n) CHAR SET LATIN" will be estimated as "CHAR(n x 2)" in TPT definition when EXPORT/LOAD as UTF8 data file.

DEFINE SCHEMA DATA_FILE_SCHEMA FROM TABLE 'tpt_data_type';

     DataConnector operator for Linux release 2.6.32-358.6.2.el6.x86_64 on eat1-etl07.corp
TDICU................................... 14.10.00.00
PXICU................................... 14.10.00.03
PMPROCS................................. 14.10.00.07
PMRWFMT................................. 14.10.00.03
PMTRCE.................................. 13.00.00.02
PMMM.................................... 03.00.00.01
DCUDDI.................................. 14.10.00.43
PMHEXDMP................................ 14.10.00.01
PMUNXDSK................................ 14.10.00.10

UseGeneralUDDIcase: 'NO (defaulted)' (=0)
WriteBOM: 'NO (defaulted)' (=0)
AcceptExcessColumns: 'NO (defaulted)' (=0)
AcceptMissingColumns: 'NO (defaulted)' (=0)
TruncateColumnData: 'NO (defaulted)' (=0)
TruncateColumns: 'NO (defaulted)' (=0)
TruncateLongCols: 'NO (defaulted)' (=0)
WARNING! RecordErrorFilePrefix attribute not specified, there is no default
RecordErrorVerbosity: OFF (default) (=0)
FileName: 'tpt_data_type.fastload'
OpenMode: 'WRITE' (2)
Format: 'FORMATTED' (3)
IOBufferSize: 131072

Full File Path: /var/tmp/tpt_data_type.fastload
Data Type Ind Length Offset M
INTEGER ( 1) 1 4 0 N
BIGINT ( 37) 1 8 4 N
VARCHAR ( 7) 1 300 12 N
VARCHAR ( 7) 1 40 312 N
CHAR ( 5) 1 2 352 N
CHAR ( 5) 1 3 354 N
DECIMAL ( 4) 1 8 357 N
CHAR ( 5) 1 19 365 N
CHAR ( 5) 1 2 384 N
CHAR ( 5) 1 19 386 N
Schema is not all character data
Schema is compatible with delimited data

===================================================================
= =
= Column/Field Definition =
= =
===================================================================

Column Name Offset Length Type
============================== ====== ====== ========================
"PRODUCT_ID" 0 4 INTEGER
"ORDER_ID" 4 8 BIGINT
"ORDER_DESC" 12 300 VARCHAR
"STORE_CODE" 314 40 VARCHAR
"ACTIVE_FLAG" 356 2 CHAR
"UNICODE_FLAG" 358 3 CHAR
"ORDER_AMT" 361 8 DECIMAL(15,2)
"ORDER_TIME" 369 19 CHAR
"ORDER_STATE" 388 2 CHAR
"UPDATED_TS" 390 19 CHAR
============================== ====== ====== ========================
INDICATOR BYTES NEEDED: 2
EXPECTED RECORD LENGTH: 411
Enthusiast

Re: CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

Hello,

Can i ask something about this?

This means that during the proccess TD transfers 3 times or two times more than needed space?( if we had only varchar columns)

Thank you.

Teradata Employee

Re: CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

TPT does not know anything about the table definition of the source table, whether the columns are of type LATIN or UNICODE.

We only work off of the client session character set.

Database column definitions are in terms of characters.

TPT schema is in terms of bytes.

Any byte in the UTF8 client session character set can be 1, 2, or 3 bytes.

And TPT does not know which.

Thus, we will document that the schema definition must be 3x the size of the table definition so that enough space is reserved for the data.

The actual bytes exported from Teradata are controlled by several factors, of which column definition and database export width are some factors.

(And this is just CHAR/VARCHAR. It gets worse with the TIMESTAMP data types.)

-- SteveF
Enthusiast

Re: CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

Thanks very much your answer.

To be honest i have not worked  a lot with TD loading tools  , but i am trying to understand only one thing about them.(network traffic)

If for example i have one table with two column 

clm1 char(10) latin

clm2 varchar(10) unicode.

and i want to insert to insert 1 M rows , no nulls.

From this example , i need at least

clm1 : 1M * 10 * 2 bytes

clm2 : 1M * 10 * 2 bytes

I meant before that a defined DDL must have a correspondind schema on TPT. 

My question is the following , if my real data are not 10 characters , does TPT send all the size? or it just sends the characters it founds.

Would it be better , if i know that my data are maximum 5 character for every columns , to change and on the table and on the TPT script the defined length to (5) , in order to decrease the time of sending the data or even to use less resources on the client side to create the data that must trasfer to TD.

Thank you.

Teradata Employee

Re: CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

If you are uncertain about your data, you are better off using VARCHAR instead of CHAR.

Less traffic across the network (as you pointed out). We only send what we receive.

As far as resources to create the data, there is no difference.

-- SteveF