Please can you help and advise on the following:-
We are in the process of exporting data from Teradata to Hadoop using the TDCH (Teradata Connector for Hadoop). We have managed to successfully import full tables from Teradata to Hadoop, however where the Teradata tables are > 200GB we want to only import the delta's on a daily basis.
We have changed our script to use SOURCEQUERY in place of SOURCETABLE and supplied the SQL with the where clause which only selects a subset of the data based on the date processing. We have also specified the method as split.by.hash, however this is being overridden by split.by.partition when using the SOURCEQUERY parameter.
Using split.by.partition causes the staging table to be created in the DB area which is the full size of the exisitng table, this is causing us issues since we do not have the spare space of replicating the table in the Databse Area and therefore our job abends with "2644 - No more Room in Database".
Please can anyone help why the method split.by.hash cannot be used when using SOURCEQUERY ?
This is a smaple script just to show the parameters we have used to invoke the IMPORT process using TDCH.
hadoop jar /usr/lib/sqoop/lib/teradata-connector-1.4.1.jar terajdbc4.jar \
-D mapreduce.job.queuename=insights \
-url jdbc:teradata://tdprod/database=insight \
-username xxxxxxxx \
-password xxxxxxxxx \
-classname com.teradata.jdbc.TeraDriver \
-fileformat textfile \
-splitbycolumn address_id \
-jobtype hdfs \
-method split.by.hash \
-targetpaths hdfs://dox/user/user1/td_prd_addresses/mail_drop_date='2015-11-20'/ \
-nummappers 1 \
-sourcequery "select col1, col2, col3, col4 from EDWPRODT.ADDRESSES where mail_drop_date = '2015-11-20'"
When importing data into Hadoop from Teradata, if you were to use a "sourcetable" option in your TDCH job, you have the ability to use either split.by.hash, split.by.value or even split.by.amp. The moment you use the "sourcequery" option, TDCH would create a temp table in the defualt db and use the split.by.partition even though you might have specified other import method. This is done for performance reasons.
In split.by.partition, each mapper starts a session for a subset of partitions from the created partitioned source staging table with a partition key the same as the distribution key. In this method, the "split by" option is mute and cannot be used. You might have also noticed that it Random fucntion. For data export from queries, this is the fastest method of data transfer.
With that said, if you do not want the job to create a staging table, you would have to create a bteq job which loaded the data from the SQL into some db which has enough space and then use the "sourcetable" option and can use the import.by.hash or import.by.amp.
Hope this helps.
When you use the "sourcequery" and the number of mappers > 1, by default it will take split.by.partition.
You can force the number of mappers to 1 (--num-mappers 1). This would force it to opt for split.by.value. This method will not create a temporary table and your problem should be solved. Split.by.value would check min(--split-by columnA) and max(--split-by-columnA) and then distribute it to the mappers, which her would be one.
Vaishak S Achar