Teradata Parallel Transporter Checkpoints and Restartability

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Teradata Parallel Transporter Checkpoints and Restartability

With today's businesses directly tied to mission-critical applications for decision making, continuity and availability are vital requirements for the success of Active Data Warehousing. As such, plans for recovery from any failure must be introduced into the design and deployment of ETL jobs as early as possible. One of the main challenges for such a plan or ETL design is how to quickly recover from a failure. This usually involves fixing the client or server systems, changing configuration parameters or system resources, restarting the interrupted jobs based on their last checkpoints, and bringing the system back to normal without resorting to rigorous manual efforts or writing piecemeal recovery procedures. Most of the time, jobs may also be required to perform catch-up so that transactions that were accumulated during the failure window can be quickly applied to the target systems. If such a recovery plan and the associated implementation are missing or cause delay to the recovery of the data warehouse, businesses that rely on timely delivery of critical information would be severely impacted.

To avoid such a breakdown, Teradata Parallel Transporter (TPT) provides a few unique features that allow you to speed up the recovery process without resorting to changing job scripts after a disaster strikes. As opposed to the optional checkpointing approach in traditional utilities, TPT runs all jobs in checkpoint mode by default; and if any one of them fails, it can restart based on the last checkpoint taken for the job.

This article will discuss:

Checkpoint Processing

A checkpoint is a job restart point created by a TPT job, which allows the job -- should it fail or be interrupted for any reason - to be restarted from the checkpoint instead of from the beginning of the job. The use of checkpoints can guarantee that most of the work performed by an interrupted job will not have to be redone, thus protecting a long-running job from having to be rerun from the beginning.

With TPT, every job is required to take at least two checkpoints, one before the data acquisition process begins (the Start-of-Data checkpoint) and one after the data acquisition process ends (the End-of-Data checkpoint). These two checkpoints guarantee that each TPT job can either restart automatically without requiring user intervention, or be restarted manually. In either case, the job restart will be based on the last checkpoint. Note the time required for taking these two checkpoints can be regarded as negligible because no in-flight data buffers on the data streams needed to be flushed for either checkpoint and the committing of rows to the target table(s) only occurs in one checkpoint, the End-of-Data checkpoint.

If a job failed before the Start-of-Data checkpoint (i.e. before the "data acquisition" phase) for any DBS "retry-able" error that includes "Teradata restart" and "deadlock on resource contention", TPT would automatically retry the failed job without terminating with an error exit code.

If a job failed after the End-of-Data checkpoint was taken, the work accomplished between these two checkpoints will not have to be repeated by the restarted job. For example, if the producer operator has finished sending rows to the Load (or Update) operator, any failure after that would not require the data to be re-sent. The Load (or Update) operator would just re-apply the data from its work table to the target table.

Although these two default checkpoints minimize the elapsed and CPU time required for checkpointing, protection against redoing work is limited. For example, if a job failed before the End-of-Data checkpoint was taken, the accomplished work after the Start-of-Data checkpoint will be lost and the job will need to be restarted from the beginning. Therefore, TPT provides additional checkpoint capabilities so that users can define their own checkpoint frequency. Note the higher the frequency of checkpointing, the less time to recover a job, but the more time for taking checkpoints.

Time-driven Checkpoints

Time-driven checkpointing can be specified by users through the TPT command option. When you specify a checkpoint interval for a TPT job (through the "tbuild -z nnn" option where nnn is in the unit of seconds), it will take a checkpoint each time the specified interval elapses, in each of its data-movement job steps (job steps with producer and consumer operators). The smaller the checkpoint interval, the more checkpoints will be taken. Frequent checkpoints can guarantee that only a limited amount of work would have to be redone if the job were interrupted and then later restarted. However, specifying a very short checkpoint interval can increase job running time due to the fact that in-flight data buffers on the data streams between the producer and consumer operators need to be flushed and committed to the target table(s) for each interval checkpoint. As such, choosing a checkpoint interval is a trade-off of the cost in increased job run time versus the potential reduction in repeated work if the job must be restarted.

User-driven Checkpoints using the CHECKPOINT External Command

A TPT job can also be directed to take a checkpoint at any time through the External Command Interface, either explicitly with the JOB CHECKPOINT command option, or implicitly with either the JOB PAUSE or the JOB TERMINATE command option, which stops job execution after a checkpoint is taken.

Since the CHECKPOINT command is issued from outside the job, you can define the frequency of checkpoints to fit your requirements and execute these checkpoints at any time you want. You are not bound by the rules of interval checkpointing, which drives checkpoints at regular intervals throughout the job. With the CHECKPOINT external command, you can drive a checkpoint whenever a condition is met or a special event occurs. For example, you can set up a procedure to monitor the number of files to be processed by the Data Connector operator. If the number reaches a certain high-water mark, you can drive a checkpoint. Here are some other conditions that you may consider driving a checkpoint:

  • When you want to take a checkpoint at a specific time.
  • When the number of rows loaded since the last checkpoint goes beyond a high-water mark.
  • When you want to terminate a job in a graceful manner and restart it later.
  • When you want to switch some of the load protocols (e.g. switching from the Stream operator to the Update operator for "catch-up" purposes) so that the new protocol can pick up from where it left off based on the most recent checkpoint.

Operator-initiated Checkpoints

In addition to performing "active directory scan" through parallelism and scalability (please see the article "TPT Active and Batch Directory Scans"), the Data Connector operator also allows active files to be archived automatically once the data in the files have been committed to the Teradata database. To "commit" the data, the Data Connector operator would initiate a checkpoint when all the rows in the files that were collected within a scan interval have been processed. This checkpoint is initiated automatically when each processing cycle ends, thus the term "operator-initiated checkpoint".

The "operator-initiated checkpoint" also works in an integrated manner with the other checkpoint options, the user-driven checkpoint and time-driven checkpoint. As a consequence, checkpoints can be taken in the following criteria:

  • For each file to be processed (operator-initiated checkpoint)
  • For a specific number of files to be processed (operator-initiated checkpoint)
  • For a set of files to be processed within an "active directory scan" cycle (operator-initiated checkpoint)
  • For a portion of a file to be processed (time-driven or user-driven checkpoint)

With these checkpoints, the data targets can be maintained with high data consistency and maximum data currency based on user requirements.

Restarting a Job from a Failure

Automatic Restart

An automatic restart means a job can restart on its own, without manual resubmission of the job. With the Start-of-Data and End-of-Data checkpoints, a job can automatically restart itself when a "retry-able" error occurs (such as a database restart or deadlock) before, during, or after the loading of data. Consider the following when dealing with automatic restarts:

  • Jobs can automatically restart as many times as is specified by the value of the RETRY option of the TPT job-launching command. By default, a job can restart up to five times.
  • If no checkpoint interval is specified for a job, and the job fails during processing, the job restarts either at the Start-of-Data checkpoint or the End-of-Data checkpoint depending on which one is the last recorded checkpoint in the checkpoint file.
  • To avoid reloading data from the beginning (especially for a long running job), specify a checkpoint interval when launching a job so the restart can be done based on the most recent checkpoint taken.

Manual Restart

If a job fails and terminates, manual restart is accomplished by resubmitting the same job with the original job-launching command. By default, all TPT jobs are checkpoint restartable using one of the two checkpoints at Start-of-Data and End-of-Data.

TPT also provides recovery across job steps within a job. In other words, if a job has multiple steps, a checkpoint will be created for each successful step, allowing a job to restart from the failed step by skipping the successful steps. For example, if you have a step to create or drop tables before the data loading step begins, and the job fails in the data loading step, restart of the job would resume from the data loading step without recreating or dropping the tables. This is in contrast with some of the utilities such as Fastload, the script of which might have DROP TABLE and CREATE TABLE statements, and therefore cannot be used across restarts because those DDL statements would be re-issued.

Removing Checkpoint Files

Job checkpoint files are automatically created by TPT and deleted if the job completes without an error. However, you will need to remove checkpoint files before they are automatically deleted if you want to do either of the following:

  • Rerun an interrupted job from the beginning, rather than restart it from the last checkpoint taken before the interruption.
  • Abandon an interrupted job and run another job, but the new checkpoint files will have the same names as the existing checkpoint files due to the use of the same job name (or the default checkpoint files created based on the logon user ID).

TPT provides a special command for users to remove checkpoint files on Windows and UNIX based on either the user ID or job name.

If the "tbuild" command specifies a job name, the "twbrmcp <job name>" command can be used. If the "tbuild" command does not specify a job name, the "twbrmcp <user ID>" can be used. For z/OS, the deletion of checkpoint files can be done through the MVS/ISPF facility.

If you want to delete checkpoint files manually, one of the following commands can be used:

On UNIX:

rm $TWB_ROOT/checkpoint/<job-name>.*

rm $TWB_ROOT/checkpoint/<user-id>.*x`

On Windows:

del %TWB_ROOT%\checkpoint\<job-name>.*

del %TWB_ROOT%\checkpoint\<user-id>.*

If you want to delete checkpoint files from a user-defined directory, one of the following commands can be used:

On UNIX:

rm <user-defined directory>/<job-name>.*

rm <user-defined directory>/<user-id>.*

On Windows:

del <user-defined directory>\<job-name>.*

del <user-defined directory>\<user-id>.*

2 REPLIES
Enthusiast

Re: Teradata Parallel Transporter Checkpoints and Restartability

Clearing the Checkpoint files, while useful information, doesn't enable the user to be able to load data into the table. Would you please outline what is necessary to re-set the table for loading after the checkpoint files are removed? In the Fastload days, doing this cleared the load lock on the table instead of doing a Create Table: e.g.
LOGON localtd/,;
begin loading MyTable errorfiles MyTable_ET, MyTable_UV
checkpoint 0 ;
end loading;
logoff;

What's the trick to this in TPT?
Enthusiast

Re: Teradata Parallel Transporter Checkpoints and Restartability

I'm ok if someone drops linzhixiang's insightful response to my question.