Convert UTF8 data in Latin Column to UNICODE

Database
Enthusiast

Convert UTF8 data in Latin Column to UNICODE

Hi Everyone,

 

I have UTF8 source data read as Latin and written into Latin column in Teradata. Now, I am changing TD column to UNICODE as that is the right thing to do. Now, how do i make sure data is translated properly & ultimately render intended character ?

 

Ex - RIGHT SINGLE QUOTATION MARK (U+2019)

UTF-8 (hex)0xE2 0x80 0x99 (e28099)

 

Data looks like this when i am querying from session with character set as Latin-

Latin Column    Hex Value

Kohl™s           4B6F686CE2809973

 

I Tried  -

create a new table with UNICODE as column and do just plain plain insert into new table sel from existing table

Data looks like this when i am querying from session with character set as UTF8

Latin Column    Hex Value

Kohlâ ™s          004B006F0068006C00E200A000990073

 

Also tried,

Session setting as Latin

Create table with Column as UNICODE

Insert into new table select translate(col latin_to_unicode)) from existing table

 

I still see -

Latin Column    Hex Value

Kohlâ ™s          004B006F0068006C00E200A000990073

 

As we can see, intended character is corrupted anyway.


Any suggestions is appreciated ?

 

Thank You!

6 REPLIES
Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Download the Unicode ToolKit from downloads.teradata.com/tools, which includes a UDF to perform this conversion.

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Thanks for information. can you please point me to UDF thati need to run ? Is Teradata LATIN nothing but Windows1252? in which case i do see udf_w1252to16.0 but not sure if this is what i need to run because my source is utf8

 

Thank you,

Ranjan

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

I understand Latin is 1SO-8859 -1 and dont see any UDF which reads UDFlatin_to_utf8.

Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Many of the Teradata "translation UDFs" take a value defined as LATIN (but which actually contains something else) as input and return a Teradata UNICODE value. Some do the reverse, as indicated in the documentation - take a Teradata UNICODE value and return a string of bytes that the database is told are LATIN characters (but may not be).

 

You select the function based on the actual encoding of the string the database was told to treat as LATIN. Based on your description of the situation you are facing, udf_utf8to16 seems to do what you want.

Highlighted
Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Thanks. That somewhat make sense But, to start with - when i load data into LATIN columns, sholudnt they have been loaded with LATIN code values ? why do i see UTF8 hex value in LATIN column. Or is it because i run CHAR2HEXINT function that it is giving me UTF8 hex value?  

My example when using CHAR2HEXINT

Name : Kohlâ ™s - hex value : 4B6F686CE2809973

 

Question is -

1. is my data storing actual values in Latin code just because column in teradata is defined as latin 

2. is my data storing actual values in utf8 code because source is utf8 though target is latin column - cant agree but not sure why i see utf8 hex value reatined for E28099 (right quotation mark)

3. is it combination of both Latin & utf8 in this case will we be able to differentiate only those characters which has utf8 code values and needs udf_utf8to16 as you were pointing but rest needs to be some kind of Latin to utf16?

 

Please clarify and put light on this as i am not clear.

At the end i do not want my company data to be corrupted with wrong decoding and encoding as this rule will be applied on entire datawarehouse and validating character by character for each data element is tedious and HUGE effort.

 

Thank You,

Ranjan

Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

The database relies on the application to identify the character set / code page of the incoming data (or data to be returned). When you load to a LATIN column using the "ASCII" session character set, you are in effect saying the input is already LATIN and no translation is needed - the bytes are stored as presented and assumed to be LATIN.

 

In UTF-8 encoding, the ASCII characters (x'00'-x'7F') are represented as those same single-byte values, so (except for the error substitution character x'1A') they will be mapped directly to U+0000 to U+007F by udf_utf8to16. The x'1A' will be mapped to the errorchar you supply. Unicode characters that are above U+007F are represented by specific multi-byte sequences (which always have the high order bit set). Depending on your Teradata release, some valid Unicode characters (e.g. supplemental plane emoji) were not acceptable in a Teradata UNICODE column, so if present they will also be replaced by the errorchar.

 

The issue would be if you have single-byte values x'80'-x'FF' that are not valid UTF-8, e.g. if the data loads were done using a mixture of UTF-8 and Windows-1252. Taking your example of RIGHT SINGLE QUOTATION MARK (U+2019), rows that have UTF-8 sequence 'E28099' would be handled correctly; but a row had Windows equivalent x'92' (which is not a valid initial byte in UTF-8) then udf_utf8to16 would fail with a message to the effect that the input was not valid UTF-8.