How to read all data from a large table with minimum locking

Database
Enthusiast

How to read all data from a large table with minimum locking

Hi all, 

I'm new to TD and quite new to databases. 

I need to periodically scan all data in my TD. The goal is to make it fast and with minimum locking, so that other transations does not have to wait or waiting time is minimized. I completely don't care about data considency and allow dirty reads.

What I did so far was simply to iterate over whole table by paginating the whole table by 10000 rows, using something like:
LOCK ROW FOR ACCESS  TOP page_size SELECT * FROM table WHERE index_column > last_index_value ORDER BY index_column 

This approach does not work for large tables with no unique index. I noticed queries with ORDER BY clause and WHERE on index column run very long - probably whole table scan (not really sure why). Such query exceeds the spool space if the table size is bigger than max spool size. 

 

I looked at FastExport tool but not sure how it performs locking - the documentation mentions that the lock is kept until all data from the query is moved to the spool, so I guess this is not the ideal way. And also from what I understood I would need to do pagination myself. 

 

I'm trying to think of any other approach (not necessarily a SQL query) to achieve that. The schema of the tables to scan vary and I'd like to create something that will suit well all tables. 

Is there a way to do it well?

 

Thanks,

Marek

20 REPLIES
Apprentice

Re: How to read all data from a large table with minimum locking

Hi Marek,

Your query "LOCK ROW FOR ACCESS  TOP page_size SELECT * FROM table WHERE index_column > last_index_value ORDER BY index_column" will often do a full  table scan (FTS) each time you execute it UNLESS the 'index_column' is a numeric data type and/or this column is defined as the PPI of the table.

 

If that column is the Primary Index of the table then the query will be an FTS, all qualifying rows will go into spool and then the 'TOP page_size' number of rows will be returned to your application.

 

You've got the 'locking for access' clause which will avoid lock contention for you and anyone else, so that is good.

 

If you're trying to return 'x' number of rows in each query and you really can take advantage of an index then you might want to try the the following approach (this assumes that you know which column(s) you're going to 'page' on - but it sounds like you've already done this):

SELECT databasename
      ,TABLENAME
 ,ROW_NUMBER() OVER(ORDER BY databasename,TABLENAME) AS seqnbr
FROM dbc.allrightsv
QUALIFY seqnbr MOD 100 = 0;

1) Decide which column(s) to use for 'paging' control. In this example I'm using 'databasename' and 'tablename'

2) Run the above query once and store the output in your application.

3) Each returned row gives you the data values for the start of a 'page' - in my example a page is 100 rows.

Assume the first three returned rows contain:

DatabaseName TableName
db1          tbl1
db8          tbl78
db10         tbl4

Your first page is all values between db1.tbl1 and db8.tbl78

Your second page is all values between  db8.tbl78 and db10,tbl4

 

Yes this query will process all rows in the table, but the spool file will not be the size of the whole table. Teradata only puts the required columns into the spool file.

 

HTH

Dave

How fast subsequent requests go really depends on

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Teradata Employee

Re: How to read all data from a large table with minimum locking

As Dave said, the access lock will not block anything else, and if you are only reading, then the lock will remain an access lock.  Also, the only way to avoid a table scan is to qualify the select on a column used to define a Partitioning Primary Index (PPI).  But we cannot assume that all tables are partitioned.

"paginating the whole table by 10000 rows" - This may be practical in sequential databases, but there is hardly anything sequential about Teradata.  You must assume data is stored randomly over many processors (AMPs).

"probably whole table scan (not really sure why)" - Because data is scattered.  The table cannot be sorted and screened without reading all of it.

"Is there a way to do it well?" If you really need to "scan" it all, just do: select an-attribute-column from table-name where an-attribute-column = some-random-value.  If the attribute-column is not indexed, the whole table will be scanned.  You will only retrieve one column to spool, and the result set will have a smaller number of rows.

As someone new to Teradata, you might be interested to know there are classes at university.teradata.com.

Teradata Employee

Re: How to read all data from a large table with minimum locking

From your description, you seem to be building an OLTP application. Certain databases are optimized for building OLTP applications. DB2 for example, has two features that can optimize such use cases. DB2 has OPTIMIZE FOR n ROWS clause when declaring cursor for pagination. This tells optimizer that not all rows from the resultsets will be retrieved and DB2 optimzer can choose to run the query incrementally. Secondly, DB2 has B-Tree style indexes, so it can use them for traversing small amounts of data when required in a particular order. 

 

Although, Teradata is quite flexible, it is still primarily targetted for handling large amounts of data using set based operations. Teradata, for example, can't join tables incrementally and return results before the entire join is evaluated. Teradata uses indexes mostly for filtering rather than ordering. (For the record, Teradata does have "value ordered secondary indexes", but they are fitted on top of Teradata File System that is optimized for hash indexes, so they can only order by a single numeric or date column that is under 4 bytes). Also, FastExport is a utility that is designed to export large amounts of data faster and will not really run your underlying query any faster.

 

Having said all that, you can still get a decent performance if you carefully design your database and application. As others have suggested, if you are new to Teradata, try reading up on some of the basic Teradata concepts and features to help you get most out of Teradata.

 

Good luck.

Enthusiast

Re: How to read all data from a large table with minimum locking

@DaveWellman 


1) Decide which column(s) to use for 'paging' control. In this example I'm using 'databasename' and 'tablename'

2) Run the above query once and store the output in your application.

3) Each returned row gives you the data values for the start of a 'page' - in my example a page is 100 rows.

I want to write something that would not require manual decision on the columns to be use for paging. I guess I could refer to internal Teradata tables to fetch all information on indexes and paritioning methods however I'm not sure how portable that would be across different Teradata versions. 

I also played with the qyalifying by row_number but in my case it lead to the same problem - exhauting the pool for very large tables. 

Enthusiast

Re: How to read all data from a large table with minimum locking

@GJColeman

"probably whole table scan (not really sure why)" - Because data is scattered.  The table cannot be sorted and screened without reading all of it.

In fact Teradata does not behave as I expected. I thought that selecting TOP N rows from results using WHERE index_column > some_value ORDER BY index_column will optimize it so that no full table scan is required (as theoretically it could be achieved just by comparing indices).

 


"Is there a way to do it well?" If you really need to "scan" it all, just do: select an-attribute-column from table-name where an-attribute-column = some-random-value.  If the attribute-column is not indexed, the whole table will be scanned.  You will only retrieve one column to spool, and the result set will have a smaller number of rows.

But how to ensure in this case I will eventually fetch all data from a table? 

 

 

 

 

 



Enthusiast

Re: How to read all data from a large table with minimum locking

Although, Teradata is quite flexible, it is still primarily targetted for handling large amounts of data using set based operations. Teradata, for example, can't join tables incrementally and return results before the entire join is evaluated. Teradata uses indexes mostly for filtering rather than ordering. (For the record, Teradata does have "value ordered secondary indexes", but they are fitted on top of Teradata File System that is optimized for hash indexes, so they can only order by a single numeric or date column that is under 4 bytes). 

I'm aware that the type of the queries I was asking don't suit Teradata. It's certainly not the primary use of the database. It's running normal operations and I'm trying to create a side task that will periodically scan whole content looking for some potential unwanted data. 

Enthusiast

Re: How to read all data from a large table with minimum locking

@padhia @GJColeman @DaveWellman

Many thanks for you valuable input. I thought about it a bit over the weekend and I figured out the following solution that should suit my needs. For every table to scan I will run some pre-query to figure out how to query data properly.

 

(PSEUDO CODE)
partitions ::= (SELECT DISTINCT PARTITION FROM table)
for part in partitions
   ranges ::= (SELECT DISTICT index_column,  COUNT(*) as count FROM table GROUP BY index_column ORDER_BY index_column WHERE PARTITION = part
  (calculate index_column_min and index_column_max to form a range where sum of count is let's say between 10000 and 15000)
  for all such ranges:
  SELECT * FROM table WHERE PARTITION = part AND index_column >= index_column_min AND index_column <= index_column_max
  end for
end for

The outer loop should divide whole problem into partitions accordind to internal paritioning, so that the result set will be smaller. I found that there is this virtual column PARTITION which I can use in WHERE clause. Then I will prepare set of ranges for the index_column according to my needs. Finally I will fetch all columns based on current partition and current range. 

It requires some data to be kept on client side but it seems it may help to produce smaller queries, that will not exceed the spool. I may also detect the situation where e.g. certain index_column value is used so many times that I would rather skip it.

Teradata Employee

Re: How to read all data from a large table with minimum locking

Always assume that ORDER BY requires a sort.  Data is not stored in any sequence, not even index data.  Also, there are several types of indexes, one of which is a Primary Partitioning Index.  The PARTITION value will have no meaning if there is no PPI.  (Note that PIs and PPIs do not store the index values anywhere - they are just hashing algorithms, not physical indexes, so reading primary index values requires reading the whole table.)

More importantly, why do you need to fetch all the rows?  If you want to create a large flat file containing some or all the columns from every row, the TPT Utility's Export Operator will do that far more efficiently.  But then how do you know which rows are unwanted? You are not going to look at each one on your screen, are you?  You must have some rules in mind for determining unwanted data; those rules can usually be expressed in SQL, and when they can't, you can write a UDF to analyze them.  If you you are planning to write code to analyze the data anyway, why not make it in the form of a UDF so you can analyze it in the database?  Apologies for asking so many questions - apparently I cannot guess what is the ultimate problem you are trying to solve....

Enthusiast

Re: How to read all data from a large table with minimum locking


GJColeman wrote:

More importantly, why do you need to fetch all the rows?  If you want to create a large flat file containing some or all the columns from every row, the TPT Utility's Export Operator will do that far more efficiently.  But then how do you know which rows are unwanted? You are not going to look at each one on your screen, are you?  You must have some rules in mind for determining unwanted data; those rules can usually be expressed in SQL, and when they can't, you can write a UDF to analyze them.  If you you are planning to write code to analyze the data anyway, why not make it in the form of a UDF so you can analyze it in the database?  Apologies for asking so many questions - apparently I cannot guess what is the ultimate problem you are trying to solve....


The goal is to create a database reader that will pass the data to a different module for processing. I just gave the "filtering unwanted data" as an example. There might be more complex processing required. Unfortunately using UDF is not possible as processing rules may be complex and may be defined during rutime and I don't see it possible to translate them easily to UDFs. I just need to read the data from of the whole table (and eventually all tables one by one), regardless its size, without blocking other tasks (my task has least priority) and with minimizing risk of exceeding assigned spool space. I need to read all rows, but the row order does not matter for me (e.g. every run of the program may return data in different order)

I would need to have a look on the TPT Utility, but what need is the full automation, so any interative tools cannot be used. I also cannot save whole table as a temporary file as I assume that some tables might be huge. I rather don't save anything on the client side just keep reading table chunk by chunk to memory and pass it for processing. 

For FastExport I saw some examples with named pipes, which looked promissing. However what I undestand is that FastExport prepare the query internally more less in the same way as normal SELECT - so I cannot just ask it to get all data as this would 1) probably lock the table for reading 2) exceed the spool if the size of the table is greater.