A high-availability system must have the ability to identify and correct errors, exceptions and failures in a timely and reliable manner to meet challenging service level objectives. The Teradata database and the utilities and components (used to both load and access data) provide capabilities to implement reliable error and exception handling functionality. These capabilities combined with a well designed high availability architecture allow a Teradata Active Enterprise Intelligence (AEI) system to meet the service level objectives required to support mission critical business processes.
This series of articles focuses on the error handling functionality and restart capabilities of the Teradata database and the Teradata data load utilities.
Examples of error and restart handling for the following utilities are included in this series:
In most cases a load utility job for each of these utilities can be re-started after a database failure and reset, a load process error or failure or a load utility client platform failure after the underlying problem has been corrected.
An Active Enterprise Intelligence Ecosystem consists of not only the Teradata database subsystem but all of the surrounding subsystems that require service from the database subsystem and provide services to the database subsystem. The following diagram illustrates the major subsystems in an AEI Ecosystem.
Each subsystem must have error and failure handling capabilities. This series covers the capabilities and approaches to handle database, data integration and application integration errors and failures. This series will focus on handling errors, exceptions and failures in a Data Integration subsystem that is implemented with Teradata load utilitities.
A system can encounter multiple types of errors and failures. Error and failure handling processes must execute the following steps:
The first step is to identify that an error or failure has occurred. In most cases a failure of the database will be detected by either a data load client or a data access client during a database operation. In this case the error must be classified and reported by the database client. In other cases the client may fail during a database operation. In each case the error must be classified.
Errors in the database or a database client can be classified as:
The ability to restart Teradata load utility operations that are interrupted by an error or exception is a key feature of these utilities. Care must be taken to identify the error properly to determine whether a restart is possible and to configure the utility scripts properly for restarts.
It is critical that errors and exceptions be identified and classified. This is necessary to determine whether the problem can be isolated and the system returned to service automatically using Teradata high availability features or if the error requires intervention. In the case of intervention the type of error will determine whether the intervention is by a DBA, programmer or Teradata support personnel.
Teradata load utilities have robust error handling and restart capabilities. This allows a database operation that is interrupted by a system error or by excessive data errors to be restarted and not have to be completely re-executed. A data integration operation may be interrupted by database errors, script errors, data errors or failures of a client process or platform. The Teradata load utilities provide the capability to restart a load operation in all of these cases.
Data integration operations are usually implemented as jobs that contain a series of steps. The steps may be defined in a single Teradata load utility script or as series of steps that execute load utilities or SQL scripts. In each case the job should be step restartable. That means that after an error the job should be able to be restarted at the point or step that was executing when the error occurred. In addition, each step should be able to be restarted or re-run. A restarted step will not re-execute the database interactions that occurred before the error. A step that is re-run will re-execute all database interactions in that step and continue with succeeding steps.
Jobs can be defined and executed using job scheduling and control systems (eg. Control-M, Tivoli, AutoSys), data integration systems (eg. Informatica, Oracle ODI, DataStage), or with scripts (eg. shell scripts, perl). In each case the job scheduling and control system should allow a job to be started at a step that was executing when an error occurred and it should allow a step to either be restarted or re-run.
Teradata utilities allow steps defined within the utility scripts to start at the step where an error occurred or to be rerun completely. They also allow a job interrupted by a database reset to continue automatically once the database is back in service.
The utilities allow checkpoints to be established to prevent the re-execution of database operations that have completed before an error caused an interruption.
Each of the load utilities allows parameters to be set that control:
All of the Teradata load utilities provide a return or completion code to the operation system or program that executed the utility. All job control definitions should check the return codes and take the appropriate action based on the return code. The Teradata utilities return the following completion codes:
In each case the specific database error must be found in the output log of the load utility to determine the next course of action (fix, re-start, re-run). A job control system can parse the output logs to determine and classify the database errors (beyond the scope of this series of articles). In all cases the underlying error must be identified, classified, fixed and then the job restarted at the appropriate step.
In the following series of articles we will describe how to use the features of Teradata load utilities to handle database and client errors and failures.