Parallelizing CSV generation in TPT while maintaining export order

Tools & Utilities

Parallelizing CSV generation in TPT while maintaining export order

Hi, 

 

We have a need to generate CSVs from the results of a query and need the order of the records in the CSV in a particular order. 

The query pulls records from one table but we need the records for a given ID field to be together in the CSV. We are doing that by specifying an orderby in the SQL query in the TPT script and this works fine functionally. 

 

We're finding there was just one executor process that was starved for cpu at 100% constantly and wanted to parallelize the Data Connector Consumer operator that writes out the CSVs ( mutiple instances of the Producer operator had no effect ). Towards this end we specified 4 instances of the Data Connector operator but this seems to be causing the order by to get messed up in the output CSVs. 

 

How can we achieve the objective of being able to use the other CPUs available on the client machine and produce the output CSVs faster and still maintain order of the ID field within each CSV ?  We're using TPT utils v15.10.

 

TIA for your attention. 

 

4 REPLIES
Teradata Employee

Re: Parallelizing CSV generation in TPT while maintaining export order

I don't think you can do this directly with the Data Connector operator.

 

If you use the -C command-line option, TPT will send data blocks to the Data Connector Consumer instances round-robin (so at least the output files will be approximately the same size).

Otherwise, blocks are sent to the first instance that isn't busy, i.e. to the second instance only when the first instance is already busy, to the third when both first and second are busy, etc.

In either case, the records within each file will be in the same relative order as they were exported, but records for the same key/group may be split across files.

 

One approach might be to use the -C option, then merge the resulting files (though you may find the merge has the same limitation). Or you could run multiple jobs in parallel, each extracting a subset of data (based on the sort keys), and concatenate the results.

Teradata Employee

Re: Parallelizing CSV generation in TPT while maintaining export order

Another thought: You might consider having the database do the conversion to text (in parallel) so there is less work left for the Data Connector operator to do on the client.

 

The CSV table function would be one way. You'd probably want to create a fixed-length sort key as the first column in the resulting VARCHAR, to enable ordering after the conversion. (Then you could use SUBSTRING to separate the key from the data to be exported.)

Re: Parallelizing CSV generation in TPT while maintaining export order

Hi Fred, 

 

Thanks for your response. 

 

Sorting outside in Unix is not an option as we tried it and it was extremely slow for our data volumes. It seems like from your replies, our only option is to have multiple TPT instances extracting different ranges of records rather than using Data Connector Instances to parallelize the writing out as we need to preserve order for the sort key within each extract file. 

 

Could you please post a link to the CSV table function you referred to ? It might help to alleviate the load on the Data Connector process and generally reduce the time for each TPT extract individually. 

 

 

Thanks for your help. 

 

 

Highlighted
Teradata Employee

Re: Parallelizing CSV generation in TPT while maintaining export order

For documentation on the CSV table function, you can use this link to search docs.teradata.com.