Help with JDBC - addBatch using PreparedStatement for large amount of data

Database
Highlighted
Ambassador

Re: Help with JDBC - addBatch using PreparedStatement for large amount of data


- For the INSERT performance, would you recommend the use of UNIQUE PRIMARY INDEX (code, user_id, timestamp) or simply using PRIMARY INDEX (code) should be sufficient?

As long as the number of rows per PI-value is low, the duplicate row checks are inexpensive, low usually means up to a few hundred.

For higher numbers you can switch to a MULTISET table avoiding those checks (this is what other DBMSes implement, they don't know about SET tables).

 


- From the very little I have learned by doing my homework, my approach to improve the SELECT performance would be to use a secondary index: INDEX (user_id, timestamp). 
This would not be ideal, as it is not unique and would still be an all-AMP operation, but at least would not be full-table scan. 

If the (user_id, timestamp) combination is used a lot you might implement it as PI.

 

The basic criteria for choosing the PI are:

  • ACCESS: The column(s) should be used a lot for joins (most important) and where-conditions (a bit less important), both based on equality
  • DISTRIBUTION: The number of rows per value should be no too high (depending on the size of the table up to 10,000s)
  • VOLATILITY: The column should be stable, i.e. hardly updated.

See Summary of Primary Index Selection Criteria

 

In fact those rules are quite similar for implementing Clustered Indexes in other DBMSes.

 

More info is needed:

- how many rows will be stored in that table (approx.) and the max number of rows per possible column (combination) in the PI?

- how is this table accessed, which column combination(s) are used in WHERE?

- are those columns updated?

 

For a large table you can also PARTITION it...

 

Enthusiast

Re: Help with JDBC - addBatch using PreparedStatement for large amount of data

The first part of your answer helped me a lot to understand more about PI definition.

 


If the (user_id, timestamp) combination is used a lot you might implement it as PI.

The reason I cannot use PRIMARY INDEX (user_id, timestamp) alone as the PI is that every time I perform the INSERT query I have 4M rows of data with the same user_id and same timestamp. Therefore, I do not have a uniform distribution of my data (one of the criteria for choosing the PI). This was actually my initial mistake (when I started this post) and the reason my batch insert was taking 15h, instead of 7 min. 

 

So, my choices were either UNIQUE PRIMARY INDEX (code, user_id, timestamp) or PRIMARY INDEX (code).

I guess everyone choosing a PI face a trade-off decision. So, I will explain my reasoning here to have your insights.

My trade-off was in terms of DISTRIBUTION x VOLATILITY only, because both of my options above were not quite good in terms of ACCESS (only PRIMARY INDEX (user_id, timestamp) would be good for ACCESS, but as I mentioned, it is so bad in terms of DISTRIBUTION that it was already eliminated as an option).

 

The combination (code, user_id, timestamp) is updated every time a new batch INSERT is issued, so it is not stable, while (code) is never updated after the first INSERT. With the combination (code, user_id, timestamp), I would have a unique value per PI value. With (code), I may have from 5 to 10 values per PI value. Based on this, code seems to be the best option for PRIMARY INDEX. 

 


More info is needed:

- how many rows will be stored in that table (approx.) and the max number of rows per possible column (combination) in the PI?

- how is this table accessed, which column combination(s) are used in WHERE?

- are those columns updated?


Based on my new PI definition, things change a bit. 

- the max number of rows in the table will be approximately 400M, with a maximum number of 10 rows per PI value. 

- the table is accessed using user_id and timestamp (that's why I wanted them as my PI, but then I have the skewness issues)

- (code) is not updated

 

I thought about PARTITION too. So, I was not sure if SI or PARTITION would be better to improve my SELECT queries. 

But by what I read about partitions, they help more on RANGE queries than equality queries. So, I will read more about partitions again to see the advantages they would have over the SI for my case.

Thanks a lot for all the info and insights!

Ambassador

Re: Help with JDBC - addBatch using PreparedStatement for large amount of data


So, my choices were either UNIQUE PRIMARY INDEX (code, user_id, timestamp) or PRIMARY INDEX (code).

I guess everyone choosing a PI face a trade-off decision. So, I will explain my reasoning here to have your insights.

My trade-off was in terms of DISTRIBUTION x VOLATILITY only, because both of my options above were not quite good in terms of ACCESS (only PRIMARY INDEX (user_id, timestamp) would be good for ACCESS, but as I mentioned, it is so bad in terms of DISTRIBUTION that it was already eliminated as an option).

If (user_id, timestamp) is good for access, but bad for distribution, it would still be a candidate for partitioning.

 

The combination (code, user_id, timestamp) is updated every time a new batch INSERT is issued, so it is not stable, while (code) is never updated after the first INSERT. With the combination (code, user_id, timestamp), I would have a unique value per PI value. With (code), I may have from 5 to 10 values per PI value. Based on this, code seems to be the best option for PRIMARY INDEX. 

Ouch, if the timestamp is updated it's very bad for both PI and partitioning. Updating a (P)PI colum means a new Hash value which is similar to DELETE and reINSERT the row in a different place.

 

 

Based on my new PI definition, things change a bit. 

- the max number of rows in the table will be approximately 400M, with a maximum number of 10 rows per PI value. 

- the table is accessed using user_id and timestamp (that's why I wanted them as my PI, but then I have the skewness issues)

- (code) is not updated

 

I thought about PARTITION too. So, I was not sure if SI or PARTITION would be better to improve my SELECT queries. 

But by what I read about partitions, they help more on RANGE queries than equality queries. So, I will read more about partitions again to see the advantages they would have over the SI for my case.

A Secondary Index is mainly usefull if it's a selective WHERE-condition and requesting 4 out of 40 million rows is not selective.

 

But partitioning is great for access based on both equality and range.

Sounds like the best PI migth be (code, user_id) which results in a good distribution and is stable. But this PI is not usefull for access, thus it should be PARTITIONed BY RANGE_N(user_id BETWEEN ....).

Now you still update lots of rows with a new timestamp, but this column is not part of the PPI.

 

 

Enthusiast

Re: Help with JDBC - addBatch using PreparedStatement for large amount of data

dnoeth, thanks a lot! You were extremely helpful! 

Everything you said makes sense to me now and I will work from there. 

I had only worked with Java up to now, but DB is very interesting and it has way more details than I thought.

I hope to learn and work more with it, so I will probably have more questions here soon.

 

Thanks once again for all the patience, info, and insights you gave! A free DB consulting service. :) I owe you that!

Thanks a lot!