Multithreaded/MultiProcess inserts in teradata (Sqoop with teradata)


Multithreaded/MultiProcess inserts in teradata (Sqoop with teradata)

Hi all,

I am trying to export a hive table to teradata using sqoop with teradata jdbc driver. The sqoop export works well only when I use a single mapper. With multiple mappers, the jdbc driver throws a '[SQLState 40001] Transaction ABORTed due to deadlock' error. This is consistently reproducible when the target teradata table does not have a primary index on it. When the target table has a primary index, the issue is generally reproducible by increasing the number of mappers (i.e. if it works with 4, it will most likely fail with 10 mappers). I am using the default values for sqoop.export.records.per.statement and sqoop.export.statements.per.transaction during sqoop export. Any idea what could be causing this? 

Just to check how teradata was behaving when inserts in the same table are done by multiple threads (or processes as in the case of multiple mappers with sqoop export), I created a multithreaded jdbc application and tried inserting some data into the target table. Each thread in this application created a new connection and did batch inserts (records.per.statement=100, statements.per.transaction=100) into the target table.

Below are my observations.

With NO PI table : I was able to consistenly reproduce the transaction deadlock issue (No duplicate records)

With PI table : Surprisingly, I was not able to reproduce it even with 12 threads each trying to insert 100000 rows. I was able to reproduce the issue when I tried to run multiple instances of this application (each instance running 4 threads). This happened when the inserted data had duplicate records.


1. What could be causing the above transaction deadlock issue?

2. Are JDBC batch insertions using multiple threads/process expected to work with Teradata?

3. I saw a few posts on the web suggesting to try lower down the sqoop export batch size (records.per.statement / statements.per.transaction) to make it work. But does that actually fixes the issue? I think it just reduces the chances of deadlock and there could be an impact on performance if we decrease the batchsize.