Dual Active: ROI on system recovery automation?

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Enthusiast

Dual Active. What are we doing?

Automated Recovery from a system going down, planed or not, is a complicated task because of the sheer volume of states capable of the logic executing on the systems. The more complex your batch system, the less disciplined your use of a structured environment, and the sheer number of these edge cases, makes this a herculean task. This problem is vastly amplified by Rapid Application Development methods.

It is my assertion that as an Analytics Industry, we’re doing this the wrong way. The typical executing environment typically grew up as a single image until the business became so dependent on it, it increased the SLA times to need a dual active environment. Problem is, that environment now consists of 100’s of applications and batch systems – none of which understand asynchronous processing – and we never intended those applications to support dual or multi stream processing with zero discrepancies.

The problems with Async processing is difficult, just look at any programmers magazine – for the last 10 years – under multi-threading. In Stevens Advanced Unix Programming Bible, there is a 4 page lecture just on the subject of race conditions – and even that assumes a shared memory environment, much less one in which we have to build infrastructures to move data between systems AND track the state of that data on systems 1000 miles apart.

Yes, you can use coding standards discipline to force sequence, this works to a point. But when your batch environment consists of 50K chains of logic, or ~250K function units, and is given care and feeding by 200 developers in 15 organizations, and those developers want to impose constant change with RAD methods, standards based solutions become as difficult as debugging multi-system race conditions.

The only options in the current dual active environments are to a) make your read/write applications fully asynchronous – essentially embedding the logic in the application, or b) force your applications to update and track state in a centralized location. Both force you to rewrite all your applications – with some serious standards discipline required.

If I bake down the basic problem of Dual Active, it is that we are trying to create a geographically global shared memory system to maintain state. This is a very complicated environment to build, with a great deal of coordination in a lot of dispirit organizational areas, for which I assert few companies have the discipline, technical expertise or the stomach to implement.

TMSM is, to some extent, an effort to implement this kind of state tracking environment. I really hope the team is successful, but my fear is the effort will turn into a state tracking environment in option b) above, which will force the re-architecture and coding of all the applications touching a system, and we’re right back where I started...

We’ve taken several runs at this over the last fiveyears, and we get better at it every time, but we never get close to the end state. I question the return on investment in implementing that last bit of nirvana, the ability to automate recovery of system up/down conditions. I certainly don’t have solutions, but I think we have to recognize there is a general problem before we start figuring out how to convert all this legacy, waterfall based applications infrastructure.

Next: Dual Loading or RAD on TD.

3 Comments
Enthusiast
Hi Michael

any existing dual active TD system there? Is eBay building the 1st dual active sytem of the TD world? :)
Enthusiast
Like most Dual Active implementations, we're on the journey. We Dual Load, but from a single source. We allow users to query both systems, but we do not abstract the two systems into a single face for users. Some applications hide the two systems, some not. Recovery from downtime is manual... no, we're far from the first...
akd
N/A
With dual loads how to you fix bad data issues now that they are on both of your systems ? Do you perform regular backups ? Trying to get a feel for your BAR strategy.