Fast insert of millions of rows

Third Party Software

Fast insert of millions of rows

Hello,

 

my goal is to insert many rows (meaning millions and more) into a Teradata table as fast as possible (DWH/BI environment).

 

My first test was with the Teradata ODBC driver, which is pretty slow.
Next try was the .NET Data Provider for Teradata 16.20.02.00, which was a bit faster.

 

I then read something about the FastLoad utility that Teradata provides.
But by using that I would have to start it as an external process from within my program. Not so nice.

 

I also came across some method called "PutBuffer" which seems to be part of the Teradata Parallel Transporter API.
Is this faster than using the .NET Data Provider?

 

I' looking for something similar to "System.Data.SqlClient.SqlBulkCopy" which is pretty fast but obviously only useable when you have MS SQL Server as your target database and not Teradata.

 

I'm coding in .NET 4.7 on Windows 10 (mainly C# but any .NET language would be ok) but I could not find any description on what to install, which DLL to reference, etc.

 

Can someone please push me in the right direction?

 

Thanks,
steinbohrer.

 

5 REPLIES
Teradata Employee

Re: Fast insert of millions of rows

TPT API in theory can be fastest, but that's native C++ / unmanaged code so may be difficult to use within a .NET environment.

I would first check out the "Batch Update" feature of TdDataAdapter in the Teradata .NET provider, and see if that is "fast enough" for your purpose. Doing INSERTs as Iterated requests provides significant performance improvement.

Re: Fast insert of millions of rows

Hi Fred,

 

thanks for your proposal.

 

I tried out using this "Iterated requets" feature by working with a TdDataAdapter / DataSet.
It is quite fast, but I don't think I can actually put this idea into production, as a DataSet holds all data in memory.

 

As I already stated, we are working in the DWH/BI space and it is not uncommon for a data source to hold hundreds of millions of rows.
So, it is technically not possible to first copy/insert all those rows into a DataSet and then write it back to the Teradata database.

 

Do you have an example on how to use this "Iterated request" feature "on the fly"?
Meaning, only storing a certain amount of all data rows in memory, but not all of them.

 

Thanks for your support.

 

Best Regards,
Michael

Teradata Employee

Re: Fast insert of millions of rows

I don't have an example at the moment.

Can you Clear / Remove the first batch of rows from the DataSet, Add a new batch of rows into the DataSet, and Update again?

Re: Fast insert of millions of rows

Hi Fred,

 

using the DataSet to copy the data in batches actually works. Thanks for the hint.

But it turns out not be to fast enough for our purposes.

We are currently trying to achieve to insert 1.000.000 rows (9 columns, short strings in each cell) in roughly 1:30 min.

 

To get there, I implemented the producer/consumer pattern.
1 producer reading the data source and multiple consumers, all reading from the same producer but having their own database connection and doing the inserts on the same table.
Running 1 producer thread with 5 to 10 consumer threads, I can copy the 1.000.000 rows from the source into the Teradata table in our time limit.

That's great.

 

But when I go for more than 1.000.000 rows, let's say 10.000.000, I get a run-time exception from the TdDataAdapter.Update method (see below).

 

Have you already seen this error?
Could you perhaps give me a hint how to best tackle this?
Can I just catch it and retry? Or is some part of the DataSet already inserted and if I just execute the Update again I will produce duplicates in the target table?

 

Thanks for your help.

 

Best Regards,
Michael


Teradata.Client.Provider.TdException
  HResult=0x80004005
  Source=.NET Data Provider for Teradata
  StackTrace:
   at System.Data.Common.DbDataAdapter.UpdatedRowStatusErrors(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount)
   at System.Data.Common.DbDataAdapter.UpdatedRowStatus(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount)
   at System.Data.Common.DbDataAdapter.Update(DataRow[] dataRows, DataTableMapping tableMapping)
   at System.Data.Common.DbDataAdapter.UpdateFromDataTable(DataTable dataTable, DataTableMapping tableMapping)
   at System.Data.Common.DbDataAdapter.Update(DataSet dataSet, String srcTable)
   at HighPerformanceInserts.LoadData.CopyData() in C:\Projects\HighPerformanceInserts\LoadData.cs:line 64
   at HighPerformanceInserts.LoadData.Start() in C:\Projects\HighPerformanceInserts\LoadData.cs:line 26
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()

Inner Exception 1:
SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

 

Highlighted
Teradata Employee

Re: Fast insert of millions of rows

I think you are right to be concerned; there doesn't seem to be enough information to know if some (or all) of the rows in the batch actually got applied.

 

I wonder if you are encountering a deadlock situation between mutiple consumers / load sessions, and that's why it times out?

 

In TPT LOAD and TPT UPDATE, all the utility sessions for the job share the same special table level lock, but you can't share locks across normal SQL sessions. Fewer instances and more frequent commits (with smaller batches) are approaches typically used to reduce lock conflicts, but that seems to run counter to your need to load faster overall. For TPT STREAM with multiple instances (similar to your application), the "Serialize" option is used to cause rows with the same PI to always be sent to the same consumer to minimize locking (instead of just round robin). You might explore that idea.

 

Another option sometimes used by applications to avoid lock conflicts is to have each consumer load to its own intermediate table (often NoPI) and then combine all the data using INSERT/SELECT at the end of the job.