I am facing a unique challenge, problem is to process a million flatfiles for ELT.
Using TPT to leverage parallelism and optimized loading. However, acquisition phase is taking longer than expected.
I am using batch directory scan, with 2 file reader instances. The process picks up well and ends reading ~1000 files in less than 30 secs, however, it slows down once crosses ~3000 files.
Avg file size is 500KB.
Need inputs as to how this can be optimized further hence reducing the overall time.
This is a bug that we had previously discovered and is being fixed in an upcoming efix/patch.
The version will be 13.10.00.12.
It will take about 4-6 weeks to appear on the Teradata At Your Service patch site.
Thanks for the update.
However, I am curious to know, is that the processing can continue at the same pace (~1000 files in less than 30 secs) for a million files once this patch is installed?
We have not had any customer try to read 1 million files.
We have customers who are processing between 10,000 and 50,000 files.
Are you using the wildcard syntax for the FileName attribute for the DC operator?
Are you checkpointing?
If so, how often?
You have all 1 million files in a single directory?
What is your row size (trying to get an idea how many rows per file)?
Are you using the Load operator?
How long are you *anticipating* the acquisition phase taking?
If this a one-time job, or something that will need to be run over and over again (e.g. daily, weekly, monthly, etc.)?
Again, we have not had anyone try to process this many files. Therefore, we cannot foresee the types of issues you might encounter.
Our file reader operator will attempt to store all of the file names and their sizes to try to perform load balancing; thus that is a lot of internal memory needed for that type of job.
I suspect you would do better with more than 2 instances of the file reader.
I had a customer several years ago that had to load 25 8GB files. He found that the best performance game from using 12 file reader instances. YMMV (your mileage may vary). You will have to experiment.
What is the name of your company?
Checkpointing: My understanding says checkpoints comes into play while applying rows to the DB. Not sure if file reader makes use of any checkpoint attribute. However, we are not specifying any checkpoint.
1 million files in a single directory: Well no, we can distribute across multiple directories on the same volume.
No. of Rows per file: wide range of values depending upon the size of the file. 500KB is just a average figure. 100 rows / 500KB
Are you using the Load operator?: Tried using Load as well as Stream operator.
How long are you *anticipating* the acquisition phase taking?: if we extrapolate the results, whole processing may take around 40-50 hrs, which doesn't seem appropriate.
We tried using upto 10 readers, however throughput was still the same.