With today's businesses directly tied to mission-critical applications for decision making, continuity and availability are vital requirements for the success of Active Data Warehousing. As such, plans for recovery from any failure must be introduced into the design and deployment of ETL jobs as early as possible. One of the main challenges for such a plan or ETL design is how to quickly recover from a failure. This usually involves fixing the client or server systems, changing configuration parameters or system resources, restarting the interrupted jobs based on their last checkpoints, and bringing the system back to normal without resorting to rigorous manual efforts or writing piecemeal recovery procedures. Most of the time, jobs may also be required to perform catch-up so that transactions that were accumulated during the failure window can be quickly applied to the target systems. If such a recovery plan and the associated implementation are missing or cause delay to the recovery of the data warehouse, businesses that rely on timely delivery of critical information would be severely impacted.
To avoid such a breakdown, Teradata Parallel Transporter (TPT) provides a few unique features that allow you to speed up the recovery process without resorting to changing job scripts after a disaster strikes. As opposed to the optional checkpointing approach in traditional utilities, TPT runs all jobs in checkpoint mode by default; and if any one of them fails, it can restart based on the last checkpoint taken for the job.
This article will discuss:
A checkpoint is a job restart point created by a TPT job, which allows the job -- should it fail or be interrupted for any reason - to be restarted from the checkpoint instead of from the beginning of the job. The use of checkpoints can guarantee that most of the work performed by an interrupted job will not have to be redone, thus protecting a long-running job from having to be rerun from the beginning.
With TPT, every job is required to take at least two checkpoints, one before the data acquisition process begins (the Start-of-Data checkpoint) and one after the data acquisition process ends (the End-of-Data checkpoint). These two checkpoints guarantee that each TPT job can either restart automatically without requiring user intervention, or be restarted manually. In either case, the job restart will be based on the last checkpoint. Note the time required for taking these two checkpoints can be regarded as negligible because no in-flight data buffers on the data streams needed to be flushed for either checkpoint and the committing of rows to the target table(s) only occurs in one checkpoint, the End-of-Data checkpoint.
If a job failed before the Start-of-Data checkpoint (i.e. before the "data acquisition" phase) for any DBS "retry-able" error that includes "Teradata restart" and "deadlock on resource contention", TPT would automatically retry the failed job without terminating with an error exit code.
If a job failed after the End-of-Data checkpoint was taken, the work accomplished between these two checkpoints will not have to be repeated by the restarted job. For example, if the producer operator has finished sending rows to the Load (or Update) operator, any failure after that would not require the data to be re-sent. The Load (or Update) operator would just re-apply the data from its work table to the target table.
Although these two default checkpoints minimize the elapsed and CPU time required for checkpointing, protection against redoing work is limited. For example, if a job failed before the End-of-Data checkpoint was taken, the accomplished work after the Start-of-Data checkpoint will be lost and the job will need to be restarted from the beginning. Therefore, TPT provides additional checkpoint capabilities so that users can define their own checkpoint frequency. Note the higher the frequency of checkpointing, the less time to recover a job, but the more time for taking checkpoints.
Time-driven checkpointing can be specified by users through the TPT command option. When you specify a checkpoint interval for a TPT job (through the "tbuild -z nnn" option where nnn is in the unit of seconds), it will take a checkpoint each time the specified interval elapses, in each of its data-movement job steps (job steps with producer and consumer operators). The smaller the checkpoint interval, the more checkpoints will be taken. Frequent checkpoints can guarantee that only a limited amount of work would have to be redone if the job were interrupted and then later restarted. However, specifying a very short checkpoint interval can increase job running time due to the fact that in-flight data buffers on the data streams between the producer and consumer operators need to be flushed and committed to the target table(s) for each interval checkpoint. As such, choosing a checkpoint interval is a trade-off of the cost in increased job run time versus the potential reduction in repeated work if the job must be restarted.
A TPT job can also be directed to take a checkpoint at any time through the External Command Interface, either explicitly with the JOB CHECKPOINT command option, or implicitly with either the JOB PAUSE or the JOB TERMINATE command option, which stops job execution after a checkpoint is taken.
Since the CHECKPOINT command is issued from outside the job, you can define the frequency of checkpoints to fit your requirements and execute these checkpoints at any time you want. You are not bound by the rules of interval checkpointing, which drives checkpoints at regular intervals throughout the job. With the CHECKPOINT external command, you can drive a checkpoint whenever a condition is met or a special event occurs. For example, you can set up a procedure to monitor the number of files to be processed by the Data Connector operator. If the number reaches a certain high-water mark, you can drive a checkpoint. Here are some other conditions that you may consider driving a checkpoint:
In addition to performing "active directory scan" through parallelism and scalability (please see the article "TPT Active and Batch Directory Scans"), the Data Connector operator also allows active files to be archived automatically once the data in the files have been committed to the Teradata database. To "commit" the data, the Data Connector operator would initiate a checkpoint when all the rows in the files that were collected within a scan interval have been processed. This checkpoint is initiated automatically when each processing cycle ends, thus the term "operator-initiated checkpoint".
The "operator-initiated checkpoint" also works in an integrated manner with the other checkpoint options, the user-driven checkpoint and time-driven checkpoint. As a consequence, checkpoints can be taken in the following criteria:
With these checkpoints, the data targets can be maintained with high data consistency and maximum data currency based on user requirements.
An automatic restart means a job can restart on its own, without manual resubmission of the job. With the Start-of-Data and End-of-Data checkpoints, a job can automatically restart itself when a "retry-able" error occurs (such as a database restart or deadlock) before, during, or after the loading of data. Consider the following when dealing with automatic restarts:
If a job fails and terminates, manual restart is accomplished by resubmitting the same job with the original job-launching command. By default, all TPT jobs are checkpoint restartable using one of the two checkpoints at Start-of-Data and End-of-Data.
TPT also provides recovery across job steps within a job. In other words, if a job has multiple steps, a checkpoint will be created for each successful step, allowing a job to restart from the failed step by skipping the successful steps. For example, if you have a step to create or drop tables before the data loading step begins, and the job fails in the data loading step, restart of the job would resume from the data loading step without recreating or dropping the tables. This is in contrast with some of the utilities such as Fastload, the script of which might have DROP TABLE and CREATE TABLE statements, and therefore cannot be used across restarts because those DDL statements would be re-issued.
Job checkpoint files are automatically created by TPT and deleted if the job completes without an error. However, you will need to remove checkpoint files before they are automatically deleted if you want to do either of the following:
TPT provides a special command for users to remove checkpoint files on Windows and UNIX based on either the user ID or job name.
If the "tbuild" command specifies a job name, the "twbrmcp <job name>" command can be used. If the "tbuild" command does not specify a job name, the "twbrmcp <user ID>" can be used. For z/OS, the deletion of checkpoint files can be done through the MVS/ISPF facility.
If you want to delete checkpoint files manually, one of the following commands can be used:
If you want to delete checkpoint files from a user-defined directory, one of the following commands can be used:
rm <user-defined directory>/<job-name>.*
rm <user-defined directory>/<user-id>.*
del <user-defined directory>\<job-name>.*
del <user-defined directory>\<user-id>.*