Convert UTF8 data in Latin Column to UNICODE

Database
Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Teradata LATIN contains a superset of the printable characters in ISO 8859-1 and ISO 8859-15, but some of the code points (hex values) between x'80' and x'FF' are assigned to different characters in Teradata internal "LATIN" vs ISO Latin character sets.  If you search for Document ID 1207 at http://info.teradata.com you can see the TD LATIN definitions. (The table is sorted in TD LATIN code point order from x'00' - x'FF', though unfortunately a column with that LATIN code point value isn't included in the current version of the document, only the corresponding Unicode code point and character name.)

 

So if your target column is UNICODE, then Teradata LATIN x'80' gets translated to U+00A0. If you are going to load UTF8 as ASCII and use the pt_utf8to16 UDF to do the translation, then the staging table column (input to the UDF) should be defined as LATIN.

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Interesting. That make sense now!

 

So, Teradata Internal LATIN is not exactly ISO 8859-x nor map to any such standard LATIN character set. Any idea what usecase or reason for which Teradata Internal Latin maps differently ? 

 

Yes, Staging is defined as LATIN. (Above scenario was noted in some trial runs & not related to my problem in suubject, please excuse)

I have 100+ tables that are defined LATIN and ETL scripts load UTF8 encoded source data into these tables with connection strings as well defined as LATIN. Based on what i learnt, i will be performing pt_utf8to16 on existing historical data in LATIN columns to load onto table replica of UNICODE char set for strings. Please correct me or suggest if this is better approach.

 

I dont see any gaps as long as source was properly coded in utf8 and all utf8 hex is preserved in LATIN columns.

 

Thank You,

Ranjan

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Hi Fred, Please share your comments.


@Ranjan_TD wrote:

Interesting. That make sense now!

 

So, Teradata Internal LATIN is not exactly ISO 8859-x nor map to any such standard LATIN character set. Any idea what usecase or reason for which Teradata Internal Latin maps differently ? 

 

Yes, Staging is defined as LATIN. (Above scenario was noted in some trial runs & not related to my problem in suubject, please excuse)

I have 100+ tables that are defined LATIN and ETL scripts load UTF8 encoded source data into these tables with connection strings as well defined as LATIN. Based on what i learnt, i will be performing pt_utf8to16 on existing historical data in LATIN columns to load onto table replica of UNICODE char set for strings. Please correct me or suggest if this is better approach.

 

I dont see any gaps as long as source was properly coded in utf8 and all utf8 hex is preserved in LATIN columns.

 

Thank You,

Ranjan


 

Highlighted
Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Yes. Loading UTF8 source to LATIN columns using ASCII session character set, then using pt_utf8to16 on the "LATIN" column to convert to UNICODE should store correct UTF16 values.

 

One additional note mentioned in the UTK documentation: Any 4-byte UTF8 sequences (code points above U+FFFF) are stored as a pair of "surrogate" code points in UTF16 representation. You don't have any of those in your examples. If there are any, prior to TD16/TTU16 you would need to use UTF16 character set for subsequent SELECT / Export sessions to have those surrogates correctly returned and interpreted by the client.

Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Why is Teradata LATIN slightly different from other "standard" character sets?

Teradata LATIN was actually defined prior to formal publication of the initial ECMA standard that became ISO 8859-1 in 1985, and well before other now-standardized "LATIN" variants that assign specific values to all 256 single-byte code points, such as ISO 8859-1:1987. The original ISO 8859-1 left a number of code points unused / undefined, and Teradata's choices for how to use those code points didn't exactly match what others eventually adopted. The only real change to Teradata LATIN that I can think of was to add the Euro symbol (in place of a seldom-used character).

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

That is good to know. Thanks for such valuable inputs. 


@Fred wrote:

Why is Teradata LATIN slightly different from other "standard" character sets?

Teradata LATIN was actually defined prior to formal publication of the initial ECMA standard that became ISO 8859-1 in 1985, and well before other now-standardized "LATIN" variants that assign specific values to all 256 single-byte code points, such as ISO 8859-1:1987. The original ISO 8859-1 left a number of code points unused / undefined, and Teradata's choices for how to use those code points didn't exactly match what others eventually adopted. The only real change to Teradata LATIN that I can think of was to add the Euro symbol (in place of a seldom-used character).