Convert UTF8 data in Latin Column to UNICODE

Database
Enthusiast

Convert UTF8 data in Latin Column to UNICODE

Hi Everyone,

 

I have UTF8 source data read as Latin and written into Latin column in Teradata. Now, I am changing TD column to UNICODE as that is the right thing to do. Now, how do i make sure data is translated properly & ultimately render intended character ?

 

Ex - RIGHT SINGLE QUOTATION MARK (U+2019)

UTF-8 (hex)0xE2 0x80 0x99 (e28099)

 

Data looks like this when i am querying from session with character set as Latin-

Latin Column    Hex Value

Kohl™s           4B6F686CE2809973

 

I Tried  -

create a new table with UNICODE as column and do just plain plain insert into new table sel from existing table

Data looks like this when i am querying from session with character set as UTF8

Latin Column    Hex Value

Kohlâ ™s          004B006F0068006C00E200A000990073

 

Also tried,

Session setting as Latin

Create table with Column as UNICODE

Insert into new table select translate(col latin_to_unicode)) from existing table

 

I still see -

Latin Column    Hex Value

Kohlâ ™s          004B006F0068006C00E200A000990073

 

As we can see, intended character is corrupted anyway.


Any suggestions is appreciated ?

 

Thank You!


Accepted Solutions
Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Yes. Loading UTF8 source to LATIN columns using ASCII session character set, then using pt_utf8to16 on the "LATIN" column to convert to UNICODE should store correct UTF16 values.

 

One additional note mentioned in the UTK documentation: Any 4-byte UTF8 sequences (code points above U+FFFF) are stored as a pair of "surrogate" code points in UTF16 representation. You don't have any of those in your examples. If there are any, prior to TD16/TTU16 you would need to use UTF16 character set for subsequent SELECT / Export sessions to have those surrogates correctly returned and interpreted by the client.

1 ACCEPTED SOLUTION
15 REPLIES
Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Download the Unicode ToolKit from downloads.teradata.com/tools, which includes a UDF to perform this conversion.

Highlighted
Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Thanks for information. can you please point me to UDF thati need to run ? Is Teradata LATIN nothing but Windows1252? in which case i do see udf_w1252to16.0 but not sure if this is what i need to run because my source is utf8

 

Thank you,

Ranjan

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

I understand Latin is 1SO-8859 -1 and dont see any UDF which reads UDFlatin_to_utf8.

Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

Many of the Teradata "translation UDFs" take a value defined as LATIN (but which actually contains something else) as input and return a Teradata UNICODE value. Some do the reverse, as indicated in the documentation - take a Teradata UNICODE value and return a string of bytes that the database is told are LATIN characters (but may not be).

 

You select the function based on the actual encoding of the string the database was told to treat as LATIN. Based on your description of the situation you are facing, udf_utf8to16 seems to do what you want.

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Thanks. That somewhat make sense But, to start with - when i load data into LATIN columns, sholudnt they have been loaded with LATIN code values ? why do i see UTF8 hex value in LATIN column. Or is it because i run CHAR2HEXINT function that it is giving me UTF8 hex value?  

My example when using CHAR2HEXINT

Name : Kohlâ ™s - hex value : 4B6F686CE2809973

 

Question is -

1. is my data storing actual values in Latin code just because column in teradata is defined as latin 

2. is my data storing actual values in utf8 code because source is utf8 though target is latin column - cant agree but not sure why i see utf8 hex value reatined for E28099 (right quotation mark)

3. is it combination of both Latin & utf8 in this case will we be able to differentiate only those characters which has utf8 code values and needs udf_utf8to16 as you were pointing but rest needs to be some kind of Latin to utf16?

 

Please clarify and put light on this as i am not clear.

At the end i do not want my company data to be corrupted with wrong decoding and encoding as this rule will be applied on entire datawarehouse and validating character by character for each data element is tedious and HUGE effort.

 

Thank You,

Ranjan

Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

The database relies on the application to identify the character set / code page of the incoming data (or data to be returned). When you load to a LATIN column using the "ASCII" session character set, you are in effect saying the input is already LATIN and no translation is needed - the bytes are stored as presented and assumed to be LATIN.

 

In UTF-8 encoding, the ASCII characters (x'00'-x'7F') are represented as those same single-byte values, so (except for the error substitution character x'1A') they will be mapped directly to U+0000 to U+007F by udf_utf8to16. The x'1A' will be mapped to the errorchar you supply. Unicode characters that are above U+007F are represented by specific multi-byte sequences (which always have the high order bit set). Depending on your Teradata release, some valid Unicode characters (e.g. supplemental plane emoji) were not acceptable in a Teradata UNICODE column, so if present they will also be replaced by the errorchar.

 

The issue would be if you have single-byte values x'80'-x'FF' that are not valid UTF-8, e.g. if the data loads were done using a mixture of UTF-8 and Windows-1252. Taking your example of RIGHT SINGLE QUOTATION MARK (U+2019), rows that have UTF-8 sequence 'E28099' would be handled correctly; but a row had Windows equivalent x'92' (which is not a valid initial byte in UTF-8) then udf_utf8to16 would fail with a message to the effect that the input was not valid UTF-8.

 

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Thanks Fred.  Excuse me, for delayed response here.

 

 

I was able to test with 4 byte characters in supplimentary plane U+1F451 (Crown), U+1D442 (Italic O). Inserted them into a LATIN column with ASCII as character set in mload & further used pass through pt_utf8to16 that successfully converted and loaded into a UNICODE column.

Also, i tested right quotation mark U+2019 similarly.

 

what i am not clear about is -

1. your comment - "Depending on your Teradata release, some valid Unicode characters (e.g. supplemental plane emoji) were not acceptable in a Teradata UNICODE column, so if present they will also be replaced by the errorchar"

what is errorchar & how to set it ? (udf_utf8to16  and pt is different i assume)

 

2. When i load right quotation mark U+2019 which is E28099 into an UNICODE column using MLOAD with session character Set as ASCII, i see in target hex value as E2A099. How did 80 become A0 ?

 

3. is there a Universal Font that covers all Unicode characters ? for example - Crown character U+1F451 can only be seen with "Segoe UI Symbol" whereas Mathematical characters like U+1D442 (Italic O) wont show up with Segoe and we have to use compatible font like "Cambria Math" in which Crown emoji wont render right. Any suggestion is much appreciated.

 

This forum is great. Thanks for continous support.

 

Thank You,

Ranjan

Teradata Employee

Re: Convert UTF8 data in Latin Column to UNICODE

1. I was trying to say that if the database does the translation (rather than using the UDF) such characters get replaced by the Unicode error substitution character U+FFFD.

2. There should be no translation happening between session character set ASCII and server LATIN. If the resulting character in the UNICODE column is correct, then the LATIN column must be correct also. How are you seeing the hex values, CHAR2HEXINT function? If the method you are using to view the hex values does some translation behind the scenes, that might explain it; for example, x'80' in Teradata LATIN is "non-breaking space" which is x'A0' in Windows-1252.

3. I'm certainly no expert on fonts, but I don't know of any "universal font".

Enthusiast

Re: Convert UTF8 data in Latin Column to UNICODE

Thanks for your valuable inputs and time.

 

Regarding 2. You said - x'80' in Teradata LATIN is "non-breaking space". How is this possible ? x'80' is control character in LATIN to what i understand.

In Teradata Document i see under International Character Set Support, Chapter 1 it reads "Teradata Database internal coding for the entire set of printable characters from the ISO 8859-1 (Latin1) and ISO 8859-15 (Latin9) standard"

Also, we know Latin 1 & Latin 15 do not have x'80' as "non-breaking space". 

 

Please clarify.

 

PS - Yes, i am using CHAR2HEXINT function to check hex value in TD.

My source is UTF8 & i am running at terminal an mload script -> mload -c ASCII< script.txt

My target is UNICODE

Behavoiur is that every x'80' is convereted to x'A0'. 

 

Thank You,

Ranjan