I have a ~250 node hadoop cluster containing a large data set that I want to move to Teradata as quickly as possible. The target Teradata system has ~100 (recent generation) nodes.
Is it conceivable to launch a fastload from each and every client with a few sessions associated with each client fastload invocation in parallel and perform the first phase of a fastload to a single target table (and then perform the second phase of fastload after completion of all phase 1 tasks or the necessary "apply" work involved in a NOPI target table).
Would be good to have a bit more background info on this.
Is it a one time task? Or a regular one?
Does this run durring the normal workload or do you have an exclusive batch window for this?
How many files? How many rows? per hadoop node?
I am not sure that I understand your suggestion - one fastload can load one table - but can load many files. If I understand your requirement correctly no would not work.
Also the number of load utility which can run in parallel is limited. you can change the number but I don't think running some hundreds is possible or advisable. Keep in mind few of these task in parallell will already max out system resources.
Many TPT stream or TPUMP might be an option - maybe into a single table.
I guess you already checked things like http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradata
But I am not sure what the throughput will be.
As stated above some more info would be needed. We can also discuss off-line if you like.
In the past, before my exposure to Hadoop, I often found that a small number (~5) of fastload or fastexport client executions would saturate the CPU on a single client ("server class") machine.
I am curious why I can run one fastload using up to one session per AMP from one client (with the caveat that after ~64 sessions the throughput plateaus) but why it would not be possible to launch the FASTLOAD program from a number of clients but have Teradata consider all the "related" fastload invocations to be a single load from the perspective of load slot utilization (and the subsequent need to enter phase 2 of fastload).
Fastload on client A with n sessions connected to TeradataSystem loading table DB_NAME.TABLE_NAME.
a scenario like:
Fastload on client A with n/26 sessions connected to TeradataSystem loading table DB_NAME.TABLE_NAME.
Fastload on client B with n/26 sessions connected to TeradataSystem loading table DB_NAME.TABLE_NAME.
Fastload on client Z with n/26 sessions connected to TeradataSystem loading table DB_NAME.TABLE_NAME.
Using an external mechanism that confirms all the phase 1 activities completed successfully, start the END LOADING phase.
The question is really broader than Hadoop to Teradata. When we migrate from one Teradata system to another, we have to "manually" chunk up the largest tables (usually on the partitioned column) and run a set of parallel FEX/FDL's (to achieve a greater throughput and reduce the migration duration).
Not quite sure what the questio is here. But if you are asking why we cannot have multiple "fastload" processes running from different client boxes at the same time, all loading the same Teradata target table, and considering that the same load job?
We can do that.
This is TPT does. Now, granted TPT only runs on a single SMP, but with multiple "instances" of a single operator, they are work in parallel to load a single table. And Teradata considers it a single load job and it uses a single utility slot.
Our ETL partners/vendors use TPTAPI to do what you are describing, because they have the parallel infrastructure to be able to handle the various operator instances across client nodes, all loading the same table.
We do it today.
Now, if you are asking for script-based TPT to be expanded to do that, all I can say is that it is on the roadmap. We are just not sure when we will get there.
Understood, thank you. I will work with our local TD account rep and our Hadoop distro reps (who represent a TD partner) to see if any plans exist to maximize integration between Teradata TPT and the the hadoop ETL tool (sqoop).
Is there any option to keep the teradata '?' values for a field persistent in the sqoop import for a table
As in preventing '?' values from a teradata table's field from becoming 'null' while sqoop import to hdfs.
This is an interesting topic. I need help with similar problem. I am looking to extract data from Teradata into several files on unix nodes. Right now we are using TPT EXP to extract data which is in order of terabytes. The data is chunked into several gigabytes based on date and several fastexport jobs write chunked data into sevral files on one unix node. we need a window of 20 hours and multiple fexp utilites on the server to achieve this. Now I would like to split the same TPT exp job on multiple unix nodes,so that more data can be extracted with one fexp utility slot (to overcome the i/o bottleneck of disk on unix nodes).The extracted data from fastexport(or tptexp) is written into multiple files on different unix nodes.