BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

Tools

BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

my question is about the output of a UNICODE column in BTEQ.

If I select a column defined like this...

   txt VARCHAR(11) CHARACTER SET UNICODE CASESPECIFIC,

.. it is not 11 characters wide , but it is 33 characters wide

(3 times the number of VARCHARs)

The bteq session character set is UTF8: .set session charset 'UTF8';

 I've already tried some CAST and FORMAT commands. For example, if I use cast(substr(txt,1,4)  as char(4) format 'X(4)') c the column requires 12 characters in BTEQ Output (4 characters multiplied by 3)

If I use an ASCII session the size of the column is 11 characters wide as expected.

In the Teradata documentation and in an orange book about "International Character Sets" I read that when a Unicode column is exported in UTF8 character set, it is always 3 times the length. (why ..?)

"My" unicode Colum contains words in different european, esp. eastern-european languges, which do not require the space of 3 'characters', but I have to use UTF8 session to display those eastern-european characters correctly (using ASCII session character does not display those characters correctly..).

So, at the moment, I see no possibiliy to format such a unicode column in a desired length ?!

Has someone any ideas to help me ... thanks in advance..

11 REPLIES
N/A

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

don't mix the number characters with the length in bytes which are needed to store this character

in latin the ratio is 1:1

in utf8 it is 1:3

So what is your problem with bteq?

can you share the code?

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

I agree that the length in bytes to STORE a character in UTF8 can be up to 3 bytes.  But I don't understand why 3 characters are needed to display it in BTEQ....

So the column 'txt' in the following example can store 11 characters, and the space required to store it is up to 33 bytes. But why is the space to display it in a BTEQ-Report 33 characters..?

Here is the Example:

in SQLA: (but that should make no difference, could also be BTEQ..)

DATABASE testdb;

CREATE SET TABLE uc_test ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT

     (

     id INTEGER,

      txt VARCHAR(11) CHARACTER SET UNICODE CASESPECIFIC

      )

   UNIQUE PRIMARY INDEX PI_uc_test (id)

   insert  into uc_test values (1, 'Zeile1');

$ bteq

 Teradata BTEQ 12.00.00.09 for LINUX.

 Copyright 1984-2010, NCR Corporation. ALL RIGHTS RESERVED.

 Enter your logon or BTEQ command:

.set session transaction ansi;

.set session transaction ansi;

 Teradata BTEQ 12.00.00.09 for LINUX. Enter your logon or BTEQ command:

.set session charset 'UTF8';

.set session charset 'UTF8';

 Teradata BTEQ 12.00.00.09 for LINUX. Enter your logon or BTEQ command:

.logon TDUSER

.logon TDUSER

Password:

 *** Logon successfully completed.

 *** Teradata Database Release is 12.00.03.03

 *** Teradata Database Version is 12.00.03.25a

 *** Transaction Semantics are ANSI.

 *** Character Set Name is 'UTF8'.

 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:

database testdb;

database testdb;

 *** New default database accepted.

 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:

commit work;

commit work;

 *** COMMIT done.

 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:

select id,txt from uc_test;

select id,txt from uc_test;

 *** Query completed. One row found. 2 columns returned.

 *** Total elapsed time was 1 second.

         id  txt

-----------  ---------------------------------

          1  Zeile1

--> in this output there are 33 characters (Number of characters, not length in bytes) used for the field 'txt' . I would like to have 11 characters....

N/A

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

I check it with some real Unicode values (and used only char(2)) in the table definition.

select * from uc_test;


 *** Query completed. 4 rows found. 2 columns returned. 

 *** Total elapsed time was 1 second.

         id  txt

-----------  ------

          1  ZZ

          4  शक

          3  ვე

          2  ₪Z

it is as you stated - 3 * chars seems to be reserved but are not needed here.

title seems also not help - wide seems to be detected by max of title length and calculated column length.

Maybe - but this is just a guess - this is related to the fact that BTEQ can be used as well for exporting data, where the byte length becomes an issue..

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

... I see, that might be the reason....

The problem occured when we changed a text-column from Latin to Unicode character set in the DDL. Afterwards the exportfile of a BTEQ report, which contains several columns in fixed character length didn' t fit on a paper page anymore, because the text columns are 3 times wider as before.

I understand that if the unicode column contains "letters", which require 3 Bytes in UTF8-Encoding, then the export of such a letter requires 3 Bytes (or characters..?) in an export produced by BTEQ in UTF-8-Encoding. To have a fixed length of a column for all possible cases, BTEQ uses the max. wide, which is 3 times the number of characters.

What I am searching for is the possibility to specify the character length in BTEQ output, but I am afraid that there is no way to do that (using SQL/BTEQ). ?!

N/A

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

Not sure that I understand your last question - you can check dbc.columns

column length should give you the number of bytes needed. CharType is giving you the char type.

TD does internally store the utf data as utf16 which means character * 2 = columnlength.

So for chartype = 2 you need to do columnlength / 2 * 3 to get the length in bteq for UDF8.

Is this what you have looking for?

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

I agree to your explanation, the max length can be computed like that using dbc.colums. But I am looking for a possibility to specify, that the BTEQ-Output of a UNICODE column does not require 3 times the number of characters that is defined in the DDL  - e.g. using FORMAT, CAST, another charset, <other ideas>...?

And - After all I still don't  see, why the export of a UNICODE column in UTF-8 characterset is implemented in BTEQ as it is. Why wouldn't it be correct to fill missing characters (if its VARCHAR..) with spaces up to the max number of characters, and not the max number of bytes used to encode the characters in UTF-8 ? (in your example: the char(2) field always contains 2 characters; they are exported correctly, where the asian characters probably use 2 or 3 bytes; in a browser/editor/.. that can display these UNICODE characters, each of them requires the space of one character; so why the 4 extra spaces ..)

N/A

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

Can't say why it is finally the way as it is - maybe someone from TD add some infos. 

Personally bet would be currently that you won't find a solution in BTEQ as long as you have to stick with the UTF8.

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

thanks so far ! If someone else can add some infos, feel free to do it ...

Re: BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

It is a pretty common yet often overlooked standard, while export/import database column to/from files (binary or flat) in the UTF8 mode, the VARCHAR(**) columns in TD (or VARCHAR(**) in ORACLE) is always calcuated as ** X 3 bytes regardless the DDL uses Latin or Unicode (ORACLE DDL uses BYTE or CHAR). This is the same rule used in OCI and SQL*Loader as well.

The funny thing is, if you switch from UTF8 mode to UTF16 mode in BTEQ/FEXP/TPT, the estimated column size will be ** X 2 bytes, which is smaller than ** X 3 bytes. This sounds unbelievable, but folks should check it out.

I bet folks all have run into the following error in BTEQ/ODBC/JDBC/TPT sometimes in life, the rows on disk are definitely smaller than 64KB, but you just can't retrieve them without chopping off several columns! Chances are if you change your session charset from UTF8 to UTF16, you may avoid this annoying error in some cases because the estimation for VARCHAR/CHAR columns in UTF16 mode is 33% smaller than UTF8 mode:

[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.

The reality is we need unicode support for multi-language, but only NAME, ADDRESS, DESCRIPTION type of columns need Unicode storage, the rest VARCHAR/CHAR columns, such as CODE or ID, are happy with Latin. A feature request to Teradata team: when come to estimate the row size for client/server communication parcels (BTEQ, FEXP, TPT, ODBC, JDBC), it will be great

  • in UTF8 mode: to use ** X 1 for Latin, to use ** X 3 for Unicode
  • in UTF16 mode: to use ** X 1 for Latin, to use ** X 2 for Unicode