UTF-8 as Character Set in SQL Assistant

Teradata Applications

UTF-8 as Character Set in SQL Assistant

I'm using UTF-8 as the character set when connecting with SQL Assistant but it seems to have trouble exporting the en dash. I get a SPA character or double blank instead. It works if I switch to ASCII which doesn't make much sense because I thought ASCII was a subset of UTF-8.

 

Anyone else experience this or have a solution?

 

Thanks.

3 REPLIES
Teradata Employee

Re: UTF-8 as Character Set in SQL Assistant

If Windows-1252 single-byte characters were (incorrectly) loaded by a session that specified or defaulted to ASCII client character set, then the Windows "en dash" would be stored essentially unchanged as x'96' in a LATIN column or U+0096 in a UNICODE column, versus the "correct" U+2013 equivalent (which could not be stored in a TD LATIN column). If you then query as ASCII, again the single-byte characters are essentially unchanged and you get back x'96'; so everything appears on the surface to be correct.

 

But if you ask for translation to UTF-8, the internal value is interpreted as the "Start of Guarded Area" control character. It's also interpreted as this control character internally in the collation sequence (for sorting and ordering).

Re: UTF-8 as Character Set in SQL Assistant

Is there any workaround? Is there a character set I should be using as the default when connecting? What is the benefit to UTF-8 over ASCII? I'm really not very clear on what the session character set does compared to setting columns in tables as LATIN vs UNICODE. If anyone could explain it or point me in the right direction for documentation that would be awesome. Thanks.

Teradata Employee

Re: UTF-8 as Character Set in SQL Assistant

If non-ASCII data is incorrectly loaded as ASCII to a LATIN column, a potential workaround is to query it as ASCII. The combination of ASCII / LATIN involves no translation, so the display will generally appear "correct" as long as the user interface uses the same character set as the source. A better solution would be to load and store the data correctly.

 

In the general case, the Session Character Set should accurately describe the data being sent or received by the client, and the database will attempt to translate that to the internal storage format (Teradata-specific LATIN or Teradata UNICODE subset) indicated by the column attribute.

 

On the other hand, many client applications or APIs already deal with Unicode characters internally and handle the translation to and from whatever "locale" is specified for the user interface. So we can use standard UTF8/UTF16 encodings for transmission rather than having the database do character set translation, and this has become the preferred approach in most cases (not just when working with Teradata). 

 

There is an "International Character Set Support" manual that explains how this works in more detail.