Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

We have made great strides in improving our handling of delimited data (i.e. CSV data) in Teradata Parallel Transporter for the TTU14.00 release. This article will describe the background of the data format, the original support, and the enhancements we have made.

Background and Theory

Comma-Separated Values

Comma-separated values (CSV) refers to a platform-independent data transport format, consisting of data values (expressed as character sequences), separated by commas. Of course, using comma as the value separator causes problems if the data values themselves include commas (for example, a single value containing both city and state, or a last-comma-first full name).

Delimiter-Separated Values

To avoid the problems of commas in data values, a delimiter-separated values (DSV) format was introduced. The DSV format allows the use of any separator, so that conflicts between data values and the delimiter can generally be avoided.

Although the delimiter is typically a single character, there is no technical reason that it cannot be a multi-character sequence.

Enclosing Data Values

Because there are cases where the nature of the data precludes certain knowledge as to what characters may or may not occur in the data values, quoted DSVs were introduced. Using quoted DSVs, the delimiter can occur within the data. The only thing that cannot occur within the data is the character used to enclose the data values.

General Rules

For the most generality, an application should accept the widest variety of DSV input, especially since there is no single widely accepted standard for CSV/DSV.

On the other hand, applications that emit CSV/DSV should be careful to limit their output so that it is acceptable to less forgiving receiving applications.

Teradata Implementation

Teradata’s Original Implementation

Teradata’s original implementation of CSV/DSV, referred to variously as VARTEXT or Delimited Data, had the following restrictions:

  • The delimiter was limited to one single-byte character (default: |, the vertical bar/pipe).
  • There was no support for quoted data values.
  • Empty values (indicated, in the case of the first field, by a delimiter at the start of the line; in the case of all fields other than the first and last fields, by adjacent delimiters; or, in the case of the last field, by a delimiter immediately followed by end-of-line) were passed to the DBS as NULL.

The result of parsing each input line was a series of VARCHAR() fields, each holding the value of the corresponding input data value.

MBCS Support

Subsequently, support for the delimiter character was expanded to allow a single multi-byte character.

TPT Enhancements For TTU 14.00

The biggest and most important aspect of the delimited data enhancement in TPT14.00 is the support of quoted delimited data. This is very important because more and more customers are moving data from non-Teradata databases into Teradata. And the export tools being used to extract data from those non-Teradata databases often write out the data to flat files in delimited format, where one or more fields are enclosed in quotes.

New DataConnector Operator Attribute

In order to enable this new feature, we have introduced a new attribute to the TPT DataConnector operator:

QuotedData

  • No (default, quoted data is not supported)
    • current rules apply
    • quotes are considered a part of the data
  • Yes (all fields must be quoted)
    • quotes are not considered to be part of the data
  • Optional (fields may be a mixture of quoted and unquoted)

Enclosing Data Values

Although the term “quoted” may seem to imply that either single quotes (apostrophes) or double quotes (quotation marks) are used to enclose the values, which is not the case:

  • The enclosing characters need not be quotes.
  • The enclosing characters need not be single characters (that is, a multi-character sequence can be used)
  • The open quote and close quote need not be the same; they can be distinct.

Note: for purposes of this article, “close quote” includes both a distinct close quote, and a common open and close quote.

Rules For Quotes

The following rules apply to both the open and close quote:

  • If the open quote and close quote are distinct from each other, neither can be a substring of the other.
  • Neither open quote nor close quote can be a substring of the delimiter.
  • The delimiter cannot be a substring of either open quote or close quote.
  • The backslash character (\) cannot occur in either the open quote or close quote.

Rules For Parsing The Input Line

Parsing the input line is relatively straightforward:

  • If values are unquoted, scan for the delimiter or end-of-line, since those are the only significant characters.
  • If values are always quoted, the following rules apply:
    • At the start of the input line, or after a delimiter, an open quote must be present (otherwise, it’s a malformed input line).
    • Following an open quote, all characters become part of the data value, with the following exceptions:
      • A doubled close quote causes one close quote to become part of the data value.
      • A backslash-escaped close quote causes the escape backslash to be discarded and the close quote to become part of the data value.
      • A backslash-escaped backslash causes the escape backslash to be discarded and the second backslash to become part of the data value.
      • An undoubled, unescaped close quote terminates the data value, and must be immediately followed by a delimiter or end-of-line.
  • If values are optionally quoted, the following rules apply:
    • At the start of the input line, or after a delimiter, if an open quote is present, the value is quoted and the rules above for always-quoted values apply.
    • Otherwise, the value is unquoted, and the rules above for unquoted values apply.

Examples of Quoted Data

Some typical five-field input lines with quoted values and using the default open/close quote might look like:

"abc"|"def"|"g|i"|"jkl"|"mno"|

"123"|"456"|"|||"|"pqr"|"xyz"

A typical five-field input line with quoted values using “sexed” quotes (“ and ”, for open and close quote, respectively) might look like:

“Smith”,“Jane”,“Dec 25, 1980”,“F”,“PhD”

A typical five-field input line with quoted values using distinct open (<#) and close (#>) quotes and comma as the delimiter might look like:

<#Smith#>,<#Jane#>,<#Dec 25, 1980#>,<#F#>,<#PhD#>

The previous examples, changed to only quote values when necessary (i.e., only when the value contains the delimiter) might look like:

abc|def|"g|i"|jkl|mno|

123|456|"|||"|pqr|xyz

Smith,Jane,“Dec 25, 1980”,F,PhD

Smith,Jane,<#Dec 25, 1980#>,F,PhD

Empty values

An unquoted value is empty:

  • When a delimiter occurs at the start of the line, then the first field is empty.
  • When two delimiters are adjacent to each other, then the field corresponding to that relative position is empty.
  • When the delimiter following the last value is omitted, and there are no characters between the preceding delimiter and end-of-line, then the last field is empty.

In each of these input lines, the first, third, and fifth fields are empty:

|abc||xyz||

|123||456|

A quoted value is empty:

  • When the open quote is immediately followed by the close quote.

In this input line, the second and fourth fields are empty:

"abc"|""|"ghi"|""|"mno"

The user can specify how empty data values are to be handled:

  • They can be assigned NULL (the default, for backward compatibility).
  • They can be set to zero-length VARCHARs.

Escapes

When dealing with quoted values, it may be necessary to include the close quote as part of the data value. Two escape mechanisms are provided for this:

  • Doubling:
    • If the open quote and close quote are the same, and a quote is needed as part of the data value, the quote is repeated (doubled) at the location where a quote is needed in the value. A single occurrence of the quote is included in the data value for each doubled quote.
    • If the open quote and close quote are distinct, and a close quote is needed as part of the data value, the close quote is repeated (doubled) at the location where a close quote is needed in the value. A single occurrence of the close quote is included in the data value for each doubled close quote. Note that no doubling is necessary to include the distinct open quote in the value.
  • Backslash escape:
    • If the open quote and close quote are the same, and a quote is needed as part of the data value, the quote is preceded by backslash (\) at the location where a quote is needed in the value. Only the quote becomes part of the value; the backslash escape is discarded.
    • If the open quote and close quote are distinct, and a close quote is needed as part of the data value, the close quote is preceded by backslash (\) at the location where a close quote is needed in the value. Only the close quote becomes part of the value; the backslash escape is discarded. Note that no backslash escape is necessary to include the distinct open quote in the value.
    • If a backslash is needed as part of the data value, it is doubled. That is, \ as part of the quoted input data value becomes a single backslash in the resultant value.

Neither escape mechanism has any meaning when a data value is unquoted. For quoted data values, the user may use either or both escape mechanisms (note that, regardless of the escape mechanism used, backslashes must be doubled).

Some examples:

"ab\"c"|"\"def"|"ghi\""|

results in three fields with values ab"c, "def, and ghi".

 

"ab""c"|"""def"|"ghi"""|

results in three fields with values ab"c, "def, and ghi".

-- SteveF
24 REPLIES
Enthusiast

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Hi, we recently get the tpt14.0 installed and testing out the features of quoted delimited file. However, I am having trouble to deal with "escape" feature by "doubling".
The data looks like "good ""job"|20333
the tpt always throw error
Delimiter did not immediately follow close quote mark in row xxxxxxx, col x

I am trying to figure out what flags needs to be changed in the data connector . So far, no success. Can you please elaborate how can I enable the "doubling" escape mechanism for data connector?
Here is the data connector's definition

DEFINE OPERATOR FILE_READER
TYPE DATACONNECTOR PRODUCER
SCHEMA W_0_s_OTHER_PAYER_1
ATTRIBUTES
(
VARCHAR FileName = 'db2_export.del',
VARCHAR Format = 'DELIMITED',
VARCHAR QuotedData = 'Optional',
VARCHAR TextDelimiter = '|',
VARCHAR OpenQuoteMark = '"',
VARCHAR CloseQuoteMark = '"',
VARCHAR OpenMode = 'Read',
VARCHAR TrimColumns = 'Both',
VARCHAR NullColumns = 'N',
VARCHAR RowErrFileName = 'db2export.del.err'
);
Teradata Employee

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

I wrote the article prior to the feature being implemented and I I believe it is possible that the "doubling" did not get fully functional.

Thus, you will need to "escape" the quote character.

You would need to do:

"good\"job"|20333

assuming you have set up the escape character:

VARCHAR EscapeQuoteDelimiter = '\'

-- SteveF
Enthusiast

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Thanks for the prompt reply... unfortunately, the source data is coming out of DB2 extracts, during which the db2 export facility default "doubling" the delimiter. the utility does not have capability to specify escape from the source side. Do you mind help to check whether the doubling feature is / will in place ? I will think about other ways to export the data out of DB2 (a lot of free text field which can contains anything... choosing a single char delimiter is quite a challenge here).
Enthusiast

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Hi,
I had similar issue,
Please consider the following data,
"Good""Job"|"Bad""Job",
The data that I wanted to load was,
Good""Job|Bad""Job,
Initially I tried with the same method as you used and it did give me same error,but later I tried it with following set of attributes and it did work,
ATTRIBUTES
(
FILENAME='/home/vw186001/data/data.dat',
Format = 'DELIMITED',
OpenMode = 'Read',
IndicatorMode = 'N',
PrivateLogName = 'Read',
AcceptExcessColumns = 'Y',
RowErrFileName = 'vw186001.jap_test.err',
TextDelimiter = '|',
TrimChar='"',
TrimColumns='Both'
)

Regards,
Vinay
Teradata Employee

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

The support of embedded characters that match the closing quote mark was not implemented in the initial "quoted VARTEXT feature. We are working on that in 14.10. However, when you do this:

"Good""Job"

you will get this:

Good"Job

(if you want to preserve a character that matches the closing quote mark, you have to double it).

This means if you want:

Good""Job

then your data will need to be:

"Good""""Job"

But again, this will not be available until 14.10.
-- SteveF
mzs
Enthusiast

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Hello,

I found this article while searching for the way to specify how TPT would handle zero-length strings.  It says

"The user can specify how empty data values are to be handled:

  • They can be assigned NULL (the default, for backward compatibility).
  • They can be set to zero-length VARCHARs."

I did not find any parameter that would modify TPT behavior - I only get NULLs if I try to load zero-length string.  Please help

Thank you

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Hi,

I ahve a similer issue.

"Good""Job"|"Bad""Job",

The data that I wanted to load was,

Good""Job|Bad""Job

The same i want to implement in version 13.1 what attribute i should use.

I am using  VARCHAR TrimChar = '"'

                    VARCHAR TrimColumns = 'Both'

but TPT is throwing error on this "line 82: syntax error at "VARCHAR" missing RPAREN_ in Rule: Attribute List Definition

TPT_INFRA: TPT02932: Error: TPT_INFRA: TPT02934: Error: invalid token

"

Teradata Employee

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

I cannot assist on any syntax errors unless you include the entire script.

-- SteveF
Enthusiast

Re: Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00

Hi Steve,

We have data as below.

12345|1234567|"abcdef" Lane|1234|xnybkhhh|2015-04-25 10:00:00

Data is pipe delimited(in tpt job it is actually hex-10) only. Some of the character fileds contains double quotes like mentioned above and they are part of data. We are not enclosing column data between quotes, just pipe delimited. We are using TTU14.10. When I am trying to load this data, it is failing with below error. I tested the same script/data in different environment where we have TTU13.10, it is loading without any issues. Can you please assist on any specific properties to be set at script level to fix this?

DATA_CONNECTOR_PRODUCER: TPT19134 !ERROR! Fatal data error processing file '/inbound/input.txt. Delimited Data Parsing error: Delimiter did not immediately follow close quote mark in row 5937, col 5.

Below is the TPT script.

USING CHARACTER SET UTF8

DEFINE JOB TABLE_TPT

DESCRIPTION 'LOAD DATA FROM FILE TO TABLE'

(

INCLUDE @SCHEM

/*****************************/

/*****************************/

DEFINE OPERATOR DATA_CONNECTOR_PRODUCER

DESCRIPTION 'TERADATA PARALLEL TRANSPORTER DATA CONNECTOR OPERATOR'

TYPE DATACONNECTOR PRODUCER

SCHEMA PP_DB

ATTRIBUTES

(

VARCHAR PrivateLogName  = @PRIVLOG,

VARCHAR DirectoryPath   = @INDIRPATH,

VARCHAR FileName        = @FILENAME,

VARCHAR Format            = 'DELIMITED',

VARCHAR TextDelimiterHex  = '10',

VARCHAR IndicatorMode     = 'N',

VARCHAR OpenMode          = 'Read',

VARCHAR ValidUTF8 = 'UTFX',

VARCHAR ReplacementUTF8Char = '?'

);

DEFINE OPERATOR LOAD_OPERATOR

DESCRIPTION 'TERADATA PARALLEL TRANSPORTER INSERTER OPERATOR'

TYPE LOAD

SCHEMA *

ATTRIBUTES

(

VARCHAR PrivateLogName    = @PRIVLOG,

VARCHAR TargetTable       = @TGTTABLE,

VARCHAR LogTable          = @LOGTABLE,

VARCHAR ErrorTable        = @ERRTABLE,

VARCHAR ErrorTable1       = @ERRTABLE1,

VARCHAR ErrorTable2       = @ERRTABLE2,

VARCHAR DropMacro         = 'Yes',

INTEGER MaxSessions       = 120 ,

INTEGER MinSessions       = 1 ,

VARCHAR TdpId             = @TDPID,

VARCHAR UserName          = @USRNAME,

VARCHAR UserPassword      = @PWD,

INTEGER MaxDecimalDigits  = 38

);

DEFINE OPERATOR DDL_OPERATOR()

DESCRIPTION 'FOR DDL OPERATOR'

TYPE DDL

ATTRIBUTES

(

VARCHAR ARRAY ErrorList   = ['3706','3803','3807'],

VARCHAR DateForm          = 'IntegerDate' ,

VARCHAR TdpId             = @TDPID,

VARCHAR UserName          = @USRNAME,

VARCHAR WorkingDatabase   = @WORKINGDB,

VARCHAR UserPassword      = @PWD

);

STEP ddl_delete_operations

(

APPLY

('DELETE FROM '|| @TGTTABLE || ' ALL;'),

('DROP TABLE '|| @ERRTABLE1 || ';'),

('DROP TABLE '|| @ERRTABLE2 || ';'),

('DROP TABLE '|| @LOGTABLE || ';')

TO OPERATOR ( DDL_OPERATOR() );

);

STEP delimited_file_to_tera

(

APPLY ('INSERT into '||@INSERT)

TO OPERATOR (LOAD_OPERATOR[@inst])

SELECT * FROM OPERATOR (DATA_CONNECTOR_PRODUCER[@inst]);

);

);