Planning Extended Teradata System outages with Unity 15.10

Unity
Unity is Teradata’s data synchronization and workload routing tool providing Active – Active database availability delivering near real-time RTO/RPO
Teradata Employee

Planning Extended Teradata System outages with Unity 15.10

Benjamin Franklin once observed "...in this world nothing can be said to be certain, except death and taxes". Planned down time can easily be added to that list. It's a certainty that any given IT system will require some planned down time to address security fixes, make configuration changes, or expand storage and compute capacity. Smart businesses always stay ahead of these demands, and promptly patch or upgrade their systems, but that often comes with a cost in down time. Paradoxically, as business systems become more critical, this cost increases and becomes harder to justify, leading to pressure not to perform necessary updates, thus putting those systems at risk. Fortunately, Teradata has solved this paradox with Unity.

Using Unity 15.10, it is possible to take planned Teradata system outages, without a business impact. One of Unity’s core benefits is the ability to sustain long planned outages without taking applications off-line. Since Unity virtualizes access across multiple Teradata systems, application workloads can continue to function when one of the Teradata systems is taken off-line by shifting them to a second synchronized Teradata system. This can be used for a variety of situations from short outages to do simple maintenance work or restarts, to longer outages to do Teradata system expansions, upgrades or even full replacements. One Teradata customer has totaled over 3.5 days of avoided outage in one quarter while performing upgrades and expansions of their Teradata system, all without any business down time. This tangible business value provides an obvious return on their business investment. 

Unity provides two elements that make this possible. First, it tracks a state for the Teradata system which can be easily changed to control access to the system. The system state is employed for both passive routing and managed routing. Using the HALT operation, you can close down and quiesce all sessions on a Teradata system. If a user’s routing rule allows them to fail over to a second system, they will do so when the system becomes OUT OF SERVICE. Secondly, Unity’s recovery of managed sessions is a powerful mechanism that allows Teradata systems to be re-synced automatically.

Unity 15.10 can manage outages on three levels – the entire system, a database, or an individual table. All three are useful in different situations. Most commonly, table or database outages are used to orchestrate daily activity for a specific application or business process. System level outages are commonly done during a Major Database upgrade or to expand or even replace the underlying managed Teradata system. Since these activities normally take 2 to 4 days, they would be very disruptive without Unity to provide high availability.

Build experience before attempting an extended outage

The ability to sustain extended planned outages is one of the most compelling benefits Unity provides, so it is one of the driving factors to adding Unity to a multisystem environment. This does not mean that environments that are new to Unity should immediate attempt a multi-day extended outage as soon as they have Unity in production. Smart shops implement high availability systems carefully and methodically. It is essential that before attempting a multi-day outage, the entire multisystem environment is mature and stable. Operations staff needs time, over a period of months, to build up experience with Unity and the environment before being pushed into a major exercise. Beyond operating and monitoring Unity, a typical system outage will normally involve a complex series of steps across all of the Teradata ecosystem products, so it’s important that they can operate Unity with full confidence and familiarity. As a best practice, organizations should have at least 6-12 months of operational experience before attempting an extended outage. To help build experience, Unity can first be used for shorter outages of a few hours.

Ecosystem planning, Backups and External access

Going into an extended outage, it’s important that a complete schedule is developed that includes all of the components of the ecosystem, such as Data Mover, Ecosystem Manager, backups, Viewpoint, etc. In particular, it’s important that backups are not executed on the Out-Of-Service Teradata system during the outage period as well as while it’s recovering. Attempting to run a backup will cause interrupts to the recovery process and also produce a backup that is inconsistent with the active state of the system.

Allowed Time for a Planned Outage

This chapter primarily deals with environments that use Unity’s managed routing to perform data synchronization across two or more Teradata systems. Unity also provides passive routing for reporting and sandbox workloads – but since passive routing doesn’t perform any data synchronization, there is no limit imposed by it on how long a system can be out-of-service, and generally no consideration or planning required for it. For managed sessions, the amount of time a Teradata system can be out-of-service for a planned outage is limited by the volume of data and SQL requests that are executed in managed sessions on the other, still active, Teradata systems.

Measuring the time allowed for a planned outage

In planning for an extended outage that will cover multiple days, it is essential to measure and understand the daily profile of load volumes. It’s tempting to compile data with a finer granularity; it has little practical value when planning for a multi-day outage. Here’s an example profile taken from one Teradata customer:

There is no visible metric displayed on the Unity viewpoint portlet that tracks daily recovery log or data space use. To collect this information, there is a sample script attached that can be run on the second (with the standby sequencer) Unity server to take this space usage. These metrics will be added as an enhancement to Unity in a future release.

While it’s a common and natural assumption that it is best to schedule an outage for the Teradata system on the weekend, the above profile clearly shows the weekend actually has a heavier workload than during the week days. If a two or three day outage is required, Tuesday or Wednesday might actually be the best time of the week to start it.

Having this profile is critical for planning, because it allows you to determine how much of the recovery and data space will be consumed during the outage. In this example, if the outage started on Friday, and lasted until Sunday, we would expect roughly 54.7 GB of recovery log space and 3.5 TB of data space to be used.

Maximizing the time allowed for a planned outage

The space available in the recovery log is much smaller (typically 100 to 200 GB) than the space available in the /data file system, which starts at 7 TB and can be expanded by adding Unity expansion servers. In order to maximize the recovery window available for outages it is important to follow normal database best practices for loading data. Large volumes of data should be loaded via bulk load protocols like Fastload, Multiload, TPT Load or TPT Update, etc. These protocols are designed for much high data volumes than normal SQL. Using a bulk load protocol for these large loads is a best practice for Teradata that becomes even more important when used with Unity.

ETL developers will sometimes break this best practice (accidently or out of laziness) in situations when there is an existing load job that normally performs a trickle feed of daily data into the data warehouse. If they reuse the job without modification to load a one-time, very large load of historical data without making it a bulk load job it can cause an unusually high amount of the recovery log to be consumed and put the recoverability of the Teradata systems at risk. This is because it can fill the Unity recovery log with data that should be rightfully stored in the much larger /data file system. This can drastically reduce the time that it is possible to sustain an outage on a Teradata system.

Safe Guarding against Rogue Load Jobs

In order to protect against load jobs consuming too much of the recovery log (when they should instead use the bulk load space), Unity has several protection mechanisms that should be used. Note that these settings should be tuned based on the size of the recovery log, required recovery window duration, and volume of workloads going through Unity. The commonly recommend setting value is provided only as a rough guide.

Unity has two alarm thresholds that can raise an alert if too much of the recovery log is being used overall or by an individual process. To warn if an individual session is consuming too much of the recovery log set these settings:

Name

Description

Commonly recommended setting

RecoveryLogGrowthSessionAlertRate

Recovery log (bytes) consumed in the last 60 minutes

5% Recovery Log size

RecoveryLogGrowthSessionAlertThreshold

Recovery log (bytes) consumed by the life time of the session

2.5% Recovery Log size

To warn if all sessions are using too much of the recovery log, use:

Name

Description

Commonly recommended setting

RecoveryLogGrowthAlertRate

Overall Recovery log (bytes) consumed by all sessions in the last 60 minutes

10% Recovery Log size

Unity also has two limits that can be used to automatically kill a session that consumes too much of the recovery log.

Name

Description

Commonly recommended setting

RecoveryLogGrowthSessionKillRate

Recovery log (bytes) consumed in the last 60 minutes.

10% Recovery Log size

RecoveryLogGrowthSessionKillThreshold

Total Recovery log (bytes) consumed by the life time of the session.

5% Recovery Log size

Starting an extended outage

A planned system outage is started by performing a HALT on a Unity managed Teradata system. This can be done via viewpoint or the unityadmin command line. The Halt operation will wait for a period of time (controlled by the config setting HaltTimeout) for in-flight transactions to finish on the Teradata system. During this time, new transactions are paused while the operation waits for current transactions to complete. If the timeout passes, and the inflight transactions have not completed, then the HALT operation will fail. In this situation, the DBA can decide to either retry the HALT and wait, or they can elect to manually abort the inflight transactions in Unity. To make this decision, it helps to have a sense of how long the transaction involved normally takes. Data from the Teradata DBQL tables can be used to find this answer.

 Increasing the HALT timeout will make it more likely the operation will succeed, but will also increase the time that new transactions are paused.

Safe guarding against accidental recovery

During an extended outage, there is a possibility that a network drop or other unexpected events might trigger the system’s dispatcher processes to disconnect and reconnect to the active Unity sequencer. When this happens, the sequencer will automatically begin recovery of the system, and attempt to put it back in service before the right time.

This problem is easy to avoid if you shut down the dispatcher processes for the Teradata system once the system has been made OUT OF SERVICE. Each Teradata system has two dispatcher processes associated with it (an active and a standby). To shut down the dispatchers, you log in (as root or using sudo) to the unity servers and do:

1. Check on which unity server the dispatcher for the system that is OUT OF SERVICE is standby:

unityadmin> status;

Sequencer: region1_seq-unity1(active), Repository: unity1(active)

Sequencer: region2_seq-unity2(standby-synchronized), Repository: unity2(standby - synced)

 

Endpoint region1_ept: any(not listening):1025

Endpoint region2_ept: any(not listening):1025

 

System db1(unrecoverable) Tables: OOS 0, standby 0, unrecoverable 4, interrupted 0, restore 0, read-only 0, active 0

Dispatcher status: region1_dsp_db1(active) - up, region2_dsp_db1(standby) - up

Gateway status: db1 - db1cop1(up)

 

System db2(unrecoverable) Tables: OOS 0, standby 0, unrecoverable 4, interrupted 0, restore 0, read-only 0, active 0

Dispatcher status: region1_dsp_db2(standby) - up, region2_dsp_db2(active) - up

Gateway status: db2 - db2cop1(up)

2. Log in to that unity server with the standby dispatcher and shut it down:

# unity stop [standby dispatcher process name]

3. Now log in to the unity server where the dispatcher is ACTIVE and shut it down.

# unity stop [standby dispatcher process name]

STOP! Make sure you ONLY shutdown the specific dispatcher process on each unity server. Do not shutdown any other processes.

When the outage is complete, and you are ready to bring the system back to the ACTIVE state, you will first start the active dispatcher, and then the standby dispatcher. Recovery of the system will start automatically as soon as the active dispatcher connects to the active sequencer

Monitoring the time remaining during an outage

While a Teradata system is in the disconnected, interrupted or unrecoverable state, Unity will keep track of how far behind it is, and the time left to return the system to the active state before it can no longer be recovered and will fall to the unrecoverable state.

You can monitor how far behind a system is through the Unity viewpoint portlet:

…Or in the unityadmin command line, using the system status command:

---------------------------------------------

System ID              : 2

System TDP ID          : db2

Region ID              : 2

Region Name            : prod2

State                  : Disconnected

ETA till unrecoverable    : > 2 day(s)

  (with maximum workload) : 12 hour(s), 14 minute(s)

Log space remaining       : 91%

Time to Recover a System

When a planned or unplanned outage is complete, a DBA can initiate a full system recovery to bring the system back to the active state by replaying all the requests sent in managed sessions. How long this takes depends on a variety of complex factors, including the size of the Teradata system and the concurrency of the workloads. In a typical environment that includes a mixture of applications doing reads and writes, the recovery of a system should typically take less than the total time the system was down. This is because during recovery the system only needs to replay the write requests that it missed.

In a workload that is very write heavy, with little or no read traffic, the time taken for a system to recover might be longer, and may equal the time of the outage window. This is especially true if the client workload continues to drive the active systems at their full throughput while the recovering system is trying to catch up. While the system is recovering, Unity’s recovery log and data file system provide the recovering system extra capacity to sync up, since they are still storing incoming write requests at the top of the recovery queue as the recovering system is replaying them from the bottom of the queue.

You can monitor the progress of recovery on the Unity viewpoint portlet:

… or on the unityadmin command line, using the system status command. Note the ETA provided in seconds.

---------------------------------------------

System ID              : 2

System TDP ID          : td2

Region ID              : 2

Region Name            : region1

State                  : Restore

System DBS Release     : 14.10.07.01

System DBS Version     : 14.10.07.01

ETA till active           : 7342

Percentage of Log replayed: 19

unityadmin>

As recovery progresses, you should see the number of tables left in the RESTORE state drop. It is normal to see recovery progress in spurts because of long-running load jobs that may cause it to appear like nothing is happening for long periods of time. However, over the course of hours, you should see the number of tables in the ACTIVE state grow as the number in the RESTORE state drops.

Occasionally, if there are any issues that cause timeouts (again, because of long-running load jobs), you may see the entire system become interrupted. This is a normal part of recovery. After a period of time (controlled by the configuration setting RecoveryInterval) recovery will restart from the same place it left off at, and you will see the number of interrupted tables quickly drop to the levels they were previously.

Monitoring system recovery for issues following a long outage

Following an extended outage, it’s important to monitor the progress of the system recovery and respond to any alerts or interrupted sessions. This is because the IDs assigned to client sessions are reused by Unity. It is these session IDs that are used to sequence requests during recovery. If there is an issue that causes a request on a session id to fail during recovery with an interrupt, that will block any later sessions in the recovery that re-use the same session id. Consequently, it’s important to ensure that any issues that appear in the recovery process are addressed as they appear in a timely manner.

This could happen for any number of reasons – for example, users mistakenly accessing one Teradata system directly (not through Unity), DDL done as part of the system maintenance that was missed or any other human or process error.  Here are some of the most common issues:

    • Database space issues
    • Missing database grants or users
    • Locked tables because of direct access to the Teradata system

These are such common conditions that DBA’s should anticipate them before a long recovery and have a plan prepared to address them should they occur.

The best place to monitor issues that appear during recovery is on the interrupted session screen, or using the unityadmin command ‘session list interrupted’. The interrupted session screen divides sessions that are interrupted into ‘Root causes’ and ‘Secondary causes’. It is normal for sessions to occasionally become interrupted if they are waiting on other sessions to finish their recovery first. These sessions appear as ‘Secondary causes’ and no actions need to be taken to address them.

Skipping Requests

Most Teradata environments have 10 thousand to 100 thousand tables, so if there are 5 or 10 that cannot be successfully recovered automatically, it should not be a major concern.  If for some reason a request repeatedly fails recovery, you can elect to SKIP the request. If there are tables involved, you should use the option to make the tables unrecoverable. It’s important to stay focused on recovering the entire system, rather than being concerned about a small number of tables that are unrecoverable.

Skipping tables

Alternatively, you can mark the tables unrecoverable to have them skipped by the recovery process. Following a prolonged outage, if you know there are data sync issues on specific tables introduced by changes directly on the system, it’s preferable to deactivate those tables pre-emptively to have them skipped in the recovery processes, rather than have them cause interrupts during the recovery process.

Completing the outage

Once the outage is finished, you can list any tables that failed the recovery process using the unityadmin command:

unityadmin> object list unrecoverable;

  State of ds.customer(2064) on system 2 (td2): unrecoverable

  State of ds2.customerLog(2065) on system 2 (td2): unrecoverable

  State of ds2.customerLogDetail(5001) on system 2 (td2): unrecoverable

As a final step, you can use Ecosystem Manager to validate that the tables are, in fact, out-of-sync and then use Ecosystem Manager workflows to run a DataMover to resync them as time permits. There is no great hurry, since business applications are still online and operating on the remaining active copies of the tables on the other Teradata systems.

Conclusion

Maintenance is a fact of life, but it doesn’t have to impact your business users. Unity provides an optimal way to accommodate planned outages without the cost to SLA’s. This value will be obvious to both CIO’s and CFO’s.

Tags (2)