Use TPT to load a million flat files

Tools
Enthusiast

Use TPT to load a million flat files

Hi,

I am facing a unique challenge, problem is to process a million flatfiles for ELT.

Using TPT to leverage parallelism and optimized loading. However, acquisition phase is taking longer than expected.

I am using batch directory scan, with 2 file reader instances. The process picks up well and ends reading ~1000 files in less than 30 secs, however, it slows down once crosses ~3000 files.

Avg file size is 500KB.

Need inputs as to how this can be optimized further hence reducing the overall time.

Thanks

Tags (1)
11 REPLIES
Teradata Employee

Re: Use TPT to load a million flat files

Have you tried using more than 2 instances of the file reader?

What version of TPT are you using?

-- SteveF
Enthusiast

Re: Use TPT to load a million flat files

I tried using upto 4 instances for the file reader.

Version 13.10

Teradata Employee

Re: Use TPT to load a million flat files

This is a bug that we had previously discovered and is being fixed in an upcoming efix/patch.

The version will be 13.10.00.12.

It will take about 4-6 weeks to appear on the Teradata At Your Service patch site.

-- SteveF
Enthusiast

Re: Use TPT to load a million flat files

Thanks for the update.

However, I am curious to know, is that the processing can continue at the same pace (~1000 files in less than 30 secs) for a million files once this patch is installed?

Teradata Employee

Re: Use TPT to load a million flat files

We have not had any customer try to read 1 million files.

We have customers who are processing between 10,000 and 50,000 files.

Are you using the wildcard syntax for the FileName attribute for the DC operator?

-- SteveF
Enthusiast

Re: Use TPT to load a million flat files

Yes using "*" for the FileName.

Teradata Employee

Re: Use TPT to load a million flat files

Are you checkpointing?

If so, how often?

You have all 1 million files in a single directory?

What is your row size (trying to get an idea how many rows per file)?

Are you using the Load operator?

How long are you *anticipating* the acquisition phase taking?

If this a one-time job, or something that will need to be run over and over again (e.g. daily, weekly, monthly, etc.)?

Again, we have not had anyone try to process this many files. Therefore, we cannot foresee the types of issues you might encounter.

Our file reader operator will attempt to store all of the file names and their sizes to try to perform load balancing; thus that is a lot of internal memory needed for that type of job.

I suspect you would do better with more than 2 instances of the file reader.

-- SteveF
Teradata Employee

Re: Use TPT to load a million flat files

I had a customer several years ago that had to load 25 8GB files. He found that the best performance game from using 12 file reader instances. YMMV (your mileage may vary). You will have to experiment.

What is the name of your company?

-- SteveF
Enthusiast

Re: Use TPT to load a million flat files

Checkpointing: My understanding says checkpoints comes into play while applying rows to the DB. Not sure if file reader makes use of any checkpoint attribute. However, we are not specifying any checkpoint.

1 million files in a single directory: Well no, we can distribute across multiple directories on the same volume.

No. of Rows per file: wide range of values depending upon the size of the file. 500KB is just a average figure. 100 rows / 500KB

Are you using the Load operator?: Tried using Load as well as Stream operator.

How long are you *anticipating* the acquisition phase taking?: if we extrapolate the results, whole processing may take around 40-50 hrs, which doesn't seem appropriate.

Frequency: monthly

We tried using upto 10 readers, however throughput was still the same.