Teradata Server Side Character Data Translation

Database
Enthusiast

Teradata Server Side Character Data Translation

Hi all,

Apologies in advance for the long post.  Also, here are some technical details probably necessary to know in relation to my question.

Teradata Server: Teradata 14.10.07.08


SQL Assistant: 15.10.0.2


Connection Mode: Teradata.NET (So I can change the session mode to UTF8, I can’t in ODBC as I don’t have administrator access on my workstation)


Client Side Character Set: UTF8 (I think – how can I confirm this?  I am going by the HELP SESSION and the Character Set column, which changes each time I adjust the settings in Teradata.NET)


Server Side Character Set: Latin (I think – how can I confirm this?  I am going by HELP SESSION and the Default Character Type column).
 
I have recently ran into some character set translation issues when using a Java UDF that I wrote.  Essentially whenever I pass a string containing non-Latin characters (i.e. Unicode) to the function, they are replaced with ‘?’.  Whenever the function passes back non-Latin characters, they are also replaced with ‘?’.  The function works with UTF8 fine outside of Teradata, it has been successfully used with Eclipse and IntelliJ in Windows and Linux, and in the Windows and Linux command line.  So I started researching and I found a Teradata document called “International Character Set Support”, which I downloaded and had a quick read through.  Some things that jumped out at me that appear related to my issue are:

Page 12

Client systems communicate with the Teradata Database using their own external format for numbers and character strings.
The Teradata Database converts numbers and strings to its own internal format when data is imported into it, and converts numbers and strings back to the appropriate form for the client when data is exported.

Page 13

Character Data Translation
The Teradata Database translates characters:
• Received from a client system into a form suitable for storage and processing on the server.
• Returned to a client into a form suitable for storage, display, printing, and processing on that client.
Thus the server translates data from client form to server form and from server form to client form.

No matter which server character set is chosen, communication with the client is always in the client character set (also known as the session charset).

Page 19

A client character set defines how the Teradata Database translates character data from client form to server form and from server form to client form. Specifically, a client character set consists of a set of translation codes that map each character in the client character set to an equivalent character in a server character set.

Page 22

During data manipulation, the Teradata Database translates characters:
• Received from a client system into a form suitable for storage and processing on the server (external-to-internal, or E2I).
• Returned to a client into a form suitable for storage, display, printing, and processing on that client (internal-to-external, or I2E).
Thus the server translates data from client form to server form and from server form to client form.

Page 136

IF the server character set is … LATIN
THEN the DBC.Translation table defines …  a translation between the character set and LATIN. All characters are accounted for using the table for conversion.

IF the server character set is … UNICODE
THEN the DBC.Translation table defines … a translation between the character set and UNICODE.  All characters are accounted for using the table for conversion.

Page 142

Conversion exception
A conversion exception occurs when there is no character in the target repertoire that corresponds to a character from the source
string.
• “E2I Conversion Exceptions” on page 143
• “I2E Conversion Exceptions” on page 143

It seems like the problem is during the E2I (external-to-internal) or I2E (internal-to-external) process, the Teradata server is performing character translation but failing because the Unicode symbols aren’t present in Latin.  For example, with my client session in UTF8 I can write my name in Korean and I can see it in SQL Assistant and I can select it as a string literal, but when I send it to my Java UDF it gets translated into ‘?’s.   So I would like to confirm if Teradata server translations sounds like a reason this is happening?  Also, what would be the solution to this problem?  Would it be changing the server character set to Unicode, and if so, how easy it that to do?  Thanks in advance.

TL;DR - I think Teradata server side character set translation is breaking my Java UDF when handling Unicode, as the server is in Latin mode.

Dallas.

Tags (1)

Accepted Solutions
Apprentice

Re: Teradata Server Side Character Data Translation

Hi Dallas,

 

I'll be upfront here and say that I don't have all the answers for you, but I think I can point you in the right direction.

 

Firstly, I think some of your thoughts below are not correct - although I can see how you get there from reading the manuals!

 

Server side character set: your comment about 'how to check this?' is spot on and holds part of the answer. This is a default for character columns that you define in tables whilst logged on.

 

The important things to know for your data are:

Client Side Character Set

This controls how Teradata expects to receive character data from the client and how it sends character data back out to the client.

Character Column definition (specifically the CHARACTER SET attribute)

This controls how the data is stored on disk and may therefore require translation of data at various points.

 

So using SQLA through NET you've set character set to UTF8. The dbms is expecting to receive character data in the UTF8 character set.

 

Assume that you send data from SQLA to the dbms and store it in a character column defined as UNICODE. The dbms will translate from UTF8 to UTF16 ('UNICODE' inside the dbms is UTF16) and store the data. Each character will require 2 bytes of storage.

 

If that data is then retrieved in your UTF8 session, the dbms will read it from the column, convert to UTF8 (the session character set) and send it back to you.

 

If the same data is retrieved by a session using the ASCII character set, the dbms will translate the data from UNICODE to ASCII and send it back (the data may or may not make sense).

 

What I'm not sure of is how this links into your Java UDF processing.

 

Your UDF will be defined to with one or more input parameters which are character.

 

Are those character parameters defined as Unicode or Latin?

What is your actual Java code expecting?

     (I suspect that these have to match up.)

 

I would expect that if the UDF parameter is defined as Unicode the the dbms will translate it to Unicode after retrieving it from a table column - however that table column is defined.

 

How does that lot sound?

 

Cheers,

Dave

 

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
1 ACCEPTED SOLUTION
3 REPLIES
Apprentice

Re: Teradata Server Side Character Data Translation

Hi Dallas,

 

I'll be upfront here and say that I don't have all the answers for you, but I think I can point you in the right direction.

 

Firstly, I think some of your thoughts below are not correct - although I can see how you get there from reading the manuals!

 

Server side character set: your comment about 'how to check this?' is spot on and holds part of the answer. This is a default for character columns that you define in tables whilst logged on.

 

The important things to know for your data are:

Client Side Character Set

This controls how Teradata expects to receive character data from the client and how it sends character data back out to the client.

Character Column definition (specifically the CHARACTER SET attribute)

This controls how the data is stored on disk and may therefore require translation of data at various points.

 

So using SQLA through NET you've set character set to UTF8. The dbms is expecting to receive character data in the UTF8 character set.

 

Assume that you send data from SQLA to the dbms and store it in a character column defined as UNICODE. The dbms will translate from UTF8 to UTF16 ('UNICODE' inside the dbms is UTF16) and store the data. Each character will require 2 bytes of storage.

 

If that data is then retrieved in your UTF8 session, the dbms will read it from the column, convert to UTF8 (the session character set) and send it back to you.

 

If the same data is retrieved by a session using the ASCII character set, the dbms will translate the data from UNICODE to ASCII and send it back (the data may or may not make sense).

 

What I'm not sure of is how this links into your Java UDF processing.

 

Your UDF will be defined to with one or more input parameters which are character.

 

Are those character parameters defined as Unicode or Latin?

What is your actual Java code expecting?

     (I suspect that these have to match up.)

 

I would expect that if the UDF parameter is defined as Unicode the the dbms will translate it to Unicode after retrieving it from a table column - however that table column is defined.

 

How does that lot sound?

 

Cheers,

Dave

 

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Enthusiast

Re: Teradata Server Side Character Data Translation

Hey Dave,

 

Firstly thanks for the quick reply and the explainations.  Those explainations certainly clear up for me how Teradata works with data on the client and server side. So I have changed my SQLA session character set to UTF16 in NET, as from what I have been reading Java strings are UTF16 also.  I had a look at the Java UDF definition and noticed a couple of issues which may or may not be related to my problem.

 

REPLACE FUNCTION SYSLIB.AES128_CBC_ENCRYPT 
  (plaintext VARCHAR(10000) CHARACTER SET UNICODE, --TO REMAIN UNICODE, AS IT WILL ACCEPT UNICODE (UTF16)
   secretkey CHAR(16) CHARACTER SET LATIN, --BEING CONVERTED TO UTF8 IN JAVA, WHICH IS NECESSARY FOR CREATING A SECRET KEY
   iv CHAR(16) CHARACTER SET LATIN) --BEING CONVERTED TO UTF8 IN JAVA, WHICH IS NECESSARY FOR CREATING AN IV
 RETURNS VARCHAR(14000) CHARACTER SET UNICODE --SHOULD RETURN LATIN, AS IT IS IN BASE64 WHICH ARE ALL LATIN CHARS
 SPECIFIC AES128_CBC_Encrypt 
 LANGUAGE JAVA 
 NO SQL
 NO EXTERNAL DATA
 PARAMETER STYLE JAVA 
 NOT DETERMINISTIC 
 RETURNS NULL ON NULL INPUT 
 EXTERNAL NAME 'AES_CBC_CRYPT:AES128_CBC_Java16.encrypt(java.lang.String,java.lang.String,java.lang.String) returns java.lang.String'

REPLACE FUNCTION SYSLIB.AES128_CBC_DECRYPT 
  (plaintext VARCHAR(14000) CHARACTER SET UNICODE, --SHOULD RETURN LATIN, AS IT IS IN BASE64 WHICH ARE ALL LATIN CHARS
   secretkey CHAR(16) CHARACTER SET LATIN, --BEING CONVERTED TO UTF8 IN JAVA, WHICH IS NECESSARY FOR CREATING A SECRET KEY
   iv CHAR(16) CHARACTER SET LATIN) ---BEING CONVERTED TO UTF8 IN JAVA, WHICH IS NECESSARY FOR CREATING AN IV
 RETURNS VARCHAR(10000) CHARACTER SET UNICODE --TO REMAIN UNICODE, AS IT WILL RETURN UNICODE (UTF16)
 SPECIFIC AES128_CBC_Decrypt 
 LANGUAGE JAVA 
 NO SQL
 NO EXTERNAL DATA
 PARAMETER STYLE JAVA 
 NOT DETERMINISTIC 
 RETURNS NULL ON NULL INPUT 
 EXTERNAL NAME 'AES_CBC_CRYPT:AES128_CBC_Java16.decrypt(java.lang.String,java.lang.String,java.lang.String) returns java.lang.String'

So I will just wait for the DBA to make the changes to the UDF and do some further testing.  Will report back once I have.  Thanks again.

 

Dallas.

Enthusiast

Re: Teradata Server Side Character Data Translation

Hi Dave,

 

Thanks again for the help and the information, in the end you were right it was the way that the UDF was defined which was the problem.  Once the function definition was changed to Unicode it was passing and returning the correct data.

 

Dallas.