Need input : How to avoid dirty read without waiting for the table to be refreshed

General

Need input : How to avoid dirty read without waiting for the table to be refreshed

Lot of our jobs needs to wait for some upstream tables that take a really long time to complete. We have some SLA's to be met
for our jobs and are OK with the data as of yesterday in these upstream tables.

Now, the problem is, the upstream tables have logics written as DELETE & INSERT (in the same BTEQ).

delete from DBNAME.TABLENAME
where UPI_COLUMNNAME in
( select UPI_COLUMNNAME
from DBNAME.INCR_DELTA_TABLENAME
)
;

insert into DBNAME.TABLENAME
(
COLUMN1,
COLUMN2
etc.,
)

This presents a serious risk, because when our ETL process is running we may not see a bunch of records in the table
(if the script that refreshes the upstream table is running concurrently). To avoid this, is it OK to re-write the
upstream scripts as below (modify it as a multi statement request)

delete from DBNAME.TABLENAME
where UPI_COLUMNNAME in
( select UPI_COLUMNNAME
from DBNAME.INCR_DELTA_TABLENAME
)
; insert into DBNAME.TABLENAME
(
);

Could anyone let us know whether this would solve the problem. Also please suggest any other way to mitigate this problem.

P.S: We are accessing the upstream tables through views which places access lock on the rows.
Tags (2)
10 REPLIES
Enthusiast

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

Your proposed solution wouldn't guarantee data is available either. A dirty read is a read uncommitted. That means you can see changes to data before a transaction has completed (successfully or not). A multi-statement request means that each statement must complete successfully before the transaction is committed. You can still see changes to data happening with an access lock in that scenario.

One option is to use a kind of table swapping approach to ETL. Basically you'll need two copies of the table in question (A and B). A is being used by the job that is updating the data and B is being used by the job that is reading data. Once table A has been updated, you can swap the tables so that the job that is reading reads from A and the job that is writing writes to B. You'll never have to worry about getting incomplete or in-process data with this approach.

However, there are some complexities. You'll also want a process to re-sync the tables after they have been swapped so the next time your write job runs, it has the correct set of data to operate on. And you'll also need a process to handle swapping the tables (either through a view change or rename).

One way to make this easier is to use views to hide the multiple tables. That way your ETL jobs don't need to be aware of which table is which. They point to a view and the view definition can be changed to point to the other table.

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

Thanks for your reply. But I am confused here, not because of your reply, but because of the way BTET working differently from multi statement requests.

Let's consider the following scenario (this is little lengthy)

create a 2 column table

CREATE dbname.btet_test
(
id INTEGER,
name CHAR(10)
)
PRIMARY INDEX ( id );

Define a view on top of this as below

replace view dbname.btet_test as locking row for access select * from dbanme.btet_test;

Populate this table with say 100 rows.

Then, write a BTEQ as below

.logon logon credentials;
.maxerror 1
BT;
delete from dbname.btet_test;
insert into dbname.btet_test values (1,'abcdefgh');
......
......
insert into dbname.btet_test values (100,'abcdefgh');
ET;

Execute the BTEQ. While the BTEQ is running, query the table through the view continuously (from Queryman). You'll see that the table record count drops to 0 (or 1 digit) and then rises up.

Now, create another BTEQ & have the statements as below

.logon logon credentials;
.maxerror 1
delete from dbname.btet_test
;insert into dbname.btet_test values (1,'abcdefgh')
;......
;......
;insert into dbname.btet_test values (100,'abcdefgh')
;

Execute the BTEQ & while the BTEQ is running query the table (from queryman) through the view. In this case, you'll not see the table's record count dropping down. Finally, you'll see only 100 records present in the table. I though I understood how BTET works, but after looking at how multi statement requests works, I am confused. Any help here is highly appreciated.

Junior Contributor

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

In your first example there are 103 requests sent across the network, most of the runtime is just waiting for the result of the previous request before sending the next.
Whereas the MSR is just a single request, you probably can't submit the queries from SQLA fast enough to capture an inconsistant state.

Run the test with some larger INSERT/SELECTs and you won't see a difference anymore.

Dieter

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

Thanks Dieter. I do see the difference with large set of data. So, there is no way to avoid the dirty read without placing a read lock on the table, which we dont want to do. Is there anyother way to solve the problem that I have given in my original post.
Enthusiast

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

If you place an exclusive lock on the table for the duration of the delete/ insert, it will ensure the table consistency.
So:

locking DBNAME.TABLENAME for Exclusive
delete from DBNAME.TABLENAME
where UPI_COLUMNNAME in
( select UPI_COLUMNNAME
from DBNAME.INCR_DELTA_TABLENAME
)
; insert into DBNAME.TABLENAME
(
);

It must be done in an MSR ( or BT/ ET block) to ensure no other request jumps in beteen the delete and insert.
The user must have drop right on the table. Also, any long running queries access the target table will block the update; you may want to use the nowait mode and check for the error.
Junior Contributor

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

If locking the table is not acceptable because it's delaying too much, then mnylin's suggestion using view switching is the way to go:

Create a copy of the target table once.
Now you got mytab#1 and mytab#2.

REPLACE VIEW mytab AS SELECT * FROM mytab#1;

Before the load starts you do:
DELETE FROM mytab#2;
INSERT mytab#2 SELECT * FROM mytab#1;
Then you load into mytab#2.

When the load is finished you switch to the newly loaded version:
REPLACE VIEW mytab AS SELECT * FROM mytab#2;
This only blocks shortly until the view is replaced.

All queries against mytab starting before the replace will finish using mytab#1, new queries start using mytab#2.

The next load is done on mytab#1 and so on...

Disadvantage:
- needs twice the diskspace to store the table copy, but you could do the delete immediately after the load finishes
Advantage:
- always consistent data
- the "old" table is still available in case anything went wrong, a kind of hot backup

Just don't do table rename as this starts blocking again.

Dieter
Enthusiast

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

This is surprising. I thought all transactions within BT and ET are committed only at ET. Why did it show a drop and rise of data on the table instead of a final result of the statements at the ET?
Enthusiast

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

We want to avoid dirty read as we have hourly loads  & had few clarifications needed before implementing view switch

 

While the select and view switch macro execute at same exact micro second/same time, what will happen to select query in below situations?

  • Will view definition always be available for select (or) will there be a situation for select to wait for view replace (Assuming there is no view lock concept)
  • Will there be any failure in select statement?

Is changing views lead to risk of system catalogue table locks?

 

Thanks for your advice!

 

Teradata Employee

Re: Need input : How to avoid dirty read without waiting for the table to be refreshed

REPLACE VIEW takes an exclusive rowHash lock on the dictionary. This can momentarily block parsing of a new SELECT request that references the view.

Conversely, any request already parsed using the old view definition has resolved to the internal ID of the old table (and does not hold a lock on the view).

Note that the SELECT requests don't acquire locks until they actually begin executing steps, so it is possible to DROP a table while there are still requests that would use it left in a delay queue; in that case those requests will fail once they are dispatched.