TDCH Batch.Insert is faster than internal.fastload

Tools & Utilities
Teradata Employee

TDCH Batch.Insert is faster than internal.fastload

Hello, I've been trying to search all over but I can't seem to find an answer to this question. We are trying to tune a TDCH transfer job from Hive to Teradata 15. Based on all documentation, internal.fastload should perform better than batch.insert but in our case, it is taking almost twice as long to finish. Any help would be appreciated. Thanks!

 

Here's the code we used for batch.insert. 

export USERLIBTDCH=/usr/lib/tdch/1.5/lib/teradata-connector-1.5.3.jar
export LIB_JARS=/etc/hive/conf.cloudera.hive,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/antlr-runtime-3.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/commons-dbcp-1.4.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/commons-pool-1.5.4.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/datanucleus-api-jdo-3.2.6.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/datanucleus-core-3.2.12.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/datanucleus-rdbms-3.2.12.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-cli-1.1.0-cdh5.7.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-exec-1.1.0-cdh5.7.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-jdbc-1.1.0-cdh5.7.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-metastore-1.1.0-cdh5.7.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/jdo-api-3.0.1.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/libfb303-0.9.2.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/libthrift-0.9.3.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-hcatalog-core-1.1.0-cdh5.7.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/avro-tools-1.7.6-cdh5.7.5.jar,/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/avro-1.7.6-cdh5.7.5.jar
export HADOOP_CLASSPATH=/etc/hive/conf.cloudera.hive:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/antlr-runtime-3.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/commons-dbcp-1.4.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/commons-pool-1.5.4.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/datanucleus-api-jdo-3.2.6.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/datanucleus-core-3.2.12.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/datanucleus-rdbms-3.2.12.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-cli-1.1.0-cdh5.7.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-exec-1.1.0-cdh5.7.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-jdbc-1.1.0-cdh5.7.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-metastore-1.1.0-cdh5.7.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/jdo-api-3.0.1.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/libfb303-0.9.2.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/libthrift-0.9.3.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/hive-hcatalog-core-1.1.0-cdh5.7.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/avro-tools-1.7.6-cdh5.7.5.jar:/opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3/jars/avro-1.7.6-cdh5.7.5.jar
hadoop jar $USERLIBTDCH \
com.teradata.connector.common.tool.ConnectorExportTool \
-Dmapreduce.job.queuename=QUEUENAME \
-libjars $LIB_JARS \
-url jdbc:teradata://FULLHOSTNAME/database=DBNAME \
-username ${TDCH_USER} \
-password ${TDCH_PASS} \
-jobtype hive \
-sourcedatabase HIVESOURCEDB \
-sourcetable HIVESOURCETABLE \
-nummappers 150 \
-targettable TDTARGET \
-fileformat orcfile

We used the same code above except for the following parameters for internal.fastload

 

-method internal.fastload
-nummappers 42

Not sure if it's relevant but the internal.fastload job is throwing the following warnings in log. 

17/12/21 17:59:26 WARN mapred.ResourceMgrDelegate: getBlacklistedTrackers - Not implemented yet

17/12/21 17:59:28 WARN split.JobSplitWriter: Max block location exceeded for split: Paths:...;  splitsize: 13 maxsize: 10

 

Target is a NOPI table on a 40 AMP system. When monitoring in Viewpoint, the internal.fastload job seems to be running only 1 AMP operation. I also tried different batch sizes from default to 200,000 but speed remains consistent for both methods.

 

Is there anything we're doing wrong?

Tags (2)