TPT - source data enclosed with double quotes and delimited by pipe (|)

Tools & Utilities
Teradata Employee

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

1. TPT is the load/unload product that we would like everyone t use.

2. The table can have any data types as long as a conversion from VARCHAR to that data type is supported by Teradata. There is no limitation that the column data types of the table being loaded with delimited data has to be all VARCHAR. (The schema in the script must be all VARCHAR, but that is not the same thing.)

3. We prefer you not to use FastLoad. Please use TPT instead of FastLoad. FastLoad is a capped utility and no new features are being added to that utility.

4. I am not familiar with the express version of Teradata on VMWare. If the installation only allows 2 AMPs then that can probably not be increased. You should check the documentation to see if there is a method for changing the configuration.

5. FastLoad does not allow skipping a header row in a file with delimited records. TPT does. Please refer to #3.

6. Yes, it looks like TDLOAD will support the "Quoted Data" feature. You will need to specify a job variable file on the command line and in that file, provide a value for the pre-defined job variables for the feature. For quoted data, the job variable name is DCPQuotedData. For skipping a header row (or set of header rows), the job variable name is DCPSkipRows (provide a number for the number of rows to skip).

TDLOAD (also known as a feature of TPT called EasyLoader) uses operator templates under the covers. These templates provide the pre-defined job variables to be specified by the user. If you look in the install directory for TPT you will see a directory called "templates". In there you can take a look at the templates for all operators and look up the attributes that are supported by the operator and the corresponding job variable names to be specified when using the TDLOAD feature.

Hope this helps!

-- SteveF
Enthusiast

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

Hi Steve,

Thanks for answering all my queries. And sorry for bothering again n again.

I have few more queries - 

1. In TPT script, why we require - (The schema in the script must be all VARCHAR, but that is not the same thing.) . When we are specifying all the datatypes in the table creation. Why we cannot define the other datatypes(other tha VARCHAR) in the schema? Is it that TPT reads/understand data from file as VARCHAR only, and then convert it into the required datatypes while loading the data?

2. Does the columnar data storage in itself is compression? Or Compresion is something different.

3. Does compression only support maximum of distinct values of 256. And is it require to specify those values? Why TD in itself find and compress the distinct values in that column.

4. It will be great if you can give me some link or steps/process to increase the no. of AMP's. I would like to try it on my VMWARE version.

Regards

Atul

Teradata Employee

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

With delimited data (also known as CSV data), the data is all character, and the fields are varying in length. Thus, by definition, the data is VARCHAR.

If the user wants the utility to read in "integer", or "decimal", then that numeric data is provided in some type of binary format.

The schema and the table definition have nothing to do with each other (directly, as far as data processing is concerned). We will send VARCHAR data to the DBS and Teradata will convert to the data types of the columns of the target table. TPT does not perform any data conversions.

Compression is handled inside the database and not by any of the client load/unload products.

I am not familiar with any process to increase the number of AMPs on your VM. You would have to get that information from some other source (if it is even possible).

-- SteveF
Enthusiast

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

Hi Steve,

I have been trying different loading techniques to load data of around 25M, to a table with around 110 columns. 

I tried with TPT script - TBUILD using FASTLOAD mainly. I have several queries below if you can help me out in understanding TERADATA more -

1. Loading data to a table with all varchar columns (same data) was slower than the table with proper datatypes - say integer, float, character, smallint etc.   -- Any specific reason can be? Is it the correct behaviour or I am doing something wrong? As from my perspective data conversion will take place in case of table with proper datatypes, and hence should take more time but this is not the case in actual.

2. Loading data from zipped files (around 50 csv zipped files) took longer time than unzipped data files?  Is it correct behaviour, if yes reason.

3. I tried loading same above data files - 25M records (50 files) using TPT - MULTILOAD and TPUMP, but both were getting hanged, even after 6-8 hrs. Whereas the FASTLOAD was averaging around 2.0 - 2.5 hours. Is it normal. Cant we load huge files/data using MULTILOAD and TPUMP? What are the best no. of files/rows/data for MULTILOAD and TPUMP. 

4. Is there any way/option through which I can load all the data to only one AMP and the data is not distributed to multiple AMP's? Will it give a better loading performance? Also will it give better performance if this is staging area and then we have to INSERT INTO SELECT to target table using this table (which is 2 step process).  

5. Is it good to load data using one session or is it good to load data with multiple sessions? Only one session is allowed per AMP?

6. Why secondary indexes are not allowed on a table while loading the data?

7. I was using the flushing parameter in TPT command TBUILD ( -z 120). It actually made the load slower drastically. Any specific reason. Is it good to use or not. Does it flushes the data which is read from flat files to Database blocks or is it something else.

8. Why load to a columnar table (or column partitioned table) is not allowed? Are there any restrictions in the architecture of Teradata when teradata is competing with many in the industry?

9. While creating a table, why columnar partitioning is not allowed with Primary Index (PI).

10.  According to Teradata, PI is automatically created using first column when we dont define the PI in CREATE Table statement.  But when I tried to create a table (row partitioned) with below syntax - 

create table on_time_performance(

..

..

)

PARTITION BY ( CASE_N( Year1 = 2009,

                       Year1 = 2010,

                       Year1 = 2011,

                       Year1 = 2012,

                       Year1 = 2013,

                       Year1 = 2014,

                       NO CASE, UNKNOWN )

);

It gave me below error - *** Failure 3732 The facility of no primary index with row partitioning but no column partitioning has not been implemented yet.

Did it not create any PI automatically? and if it does why above error was thrown?

11. When we use - delete from tduser.on_time_performance_d all;  

is it equivalent to TRUNCATE in Oracle? Will it also change the high-water-mark.

12. Each AMP is like a separate node which work independently with its own resources?

Regards

Atul

Enthusiast

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

Hi Steve,

Can you please help me out on above.

Regards

Atul

Teradata Employee

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

I will not be able to address all of the questions, but will try my best:

1. when you say you are loading a Teradata table defined with all VARCHAR fields, does that mean you defined your table that way because your input data is delimited and thus treated as all VARCHAR? Or was this was just an experiment to test performance with different table schemas? Also, when you say the performance is slower, do you mean from the client side or the DBS side?

If you are asking about loading performance with delimtied data, it is slower than reading in (binary) data with native data types because of the parsing involved in processing delimited records.

If you are asking about Teradata RDBMS performance, I cannot comment on that.

2. Loading data from zipped files will not be as performant as unzipped files because of the time it takes to use the 3rd party tools to process the zipped files (probably some "unzipping" involved under the covers).

3. I cannot comment on why the Update or Stream operators would hang. That is a separate issue by itself that would have to be researched and diagnosed. TPT has no problems loading any size data file using any of the operators. If you would like information about Stream operator tuning parameters, that is an entirely separate (and long) subject and not appropriate for this particular thread.

4. I am not sure why you would want all of the data from a single file to go to a single AMP (data skew), but it is possible if you load to a NoPI table and use only one session in the job to load the data. Loading to a NoPI table is generally faster than loading to a PI table, but with only one session, I am not sure that the performance will be better than loading with multiple sessions (to a PI table).

5. Ths whole point about improving performance is about parallelism. Loading with multiple sessions will always be better than loading with one session. When using the Load and Update operators, the max limit on sessions is one per "available" AMP.

6. Secondary indexes are not allowed when using the Load and Update operators only. You can always use the Stream (or SQL Inserter) operator to load data (through normal SQL sessions) to tables with secondary indexes. It is all about performance.

7. The -z command line option is not for "flushing", it is for checkpointing. If you would like to create checkpoint intervals so that if the DBS restarts, you do not have to restart from the very beginning, then you use the -z option. Checkpointing is an expensive action, though. The entire load process has to come to a stop while the data in the shared memory buffers are completely sent to the DBS, followed by a checkpoint set of actions with the DBS. How often you checkpoint is dependent upon how much time you are willing to lose (to re-send data that was not checkpointed). It also may depend on the amount of data you are loading. Everyone's use of checkpointing is different. I know some customers who are loading transactional dataand checkpoint every few seconds (they want the data available sooner) and there are some that might checkpoint every 15 or 30 minutes on very long running jobs.

8. Loading to columnar tables is allowed with TPT operators that use normal SQL sessions (Stream, SQL Inserter).

9. I cannot answer this question. It is a DBS internal issue and not my area.

10. I cannot answer this question. It is a DBS issue and not my area.

11. I do not know the answer to this question.

12. No, each AMP is not like a separate node. But it does work independently and has its own resources.

-- SteveF
Not applicable

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

Can someone share me a ppt regarding basics of Teradata. with diagrams will be more use..

Enthusiast

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

PT - source data enclosed with double quotes and delimited by (,).

I have to load the data from csv to table we for date columns are coming with double quotes. How do i load those records into table using TPT.

Below is the source sample data: Test date – “2019-06-24 08:37:17”. I am using below parms.

,DCPQuotedData='optional'

,DCPReaderOpenQuoteMark='yes'
,DCPReaderCloseQuoteMark='yes'

,SrcQuotedData='optional'
,SrcReaderOpenQuoteMark='yes'
,SrcReaderCloseQuoteMark='yes'

 

Error: Delimited Data Parsing error: Column length overflow(s) in row 1.

 

If i removed quotes from file them i am able to load the data.

 

Can anyone help to resolve this?

 

 

Teradata Employee

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

Please start a new topic if you have a different question. 

OpenQuoteMark supplies the actual character used for quoting, not Yes/No (and you can omit CloseQuoteMark if it's the same as OpenQuoteMark).

 

,DCPReaderOpenQuoteMark='"'

 

Enthusiast

Re: TPT - source data enclosed with double quotes and delimited by pipe (|)

HI Fred,

Thanks for helping me. I am getting double quotes for only few columns in file not all the columns. Few are coming with quotes and rest all are coming without quotes. How do i handle in this scnario? I tried using the below but it did not worked i think it is expectecting quotes for all the columns.