What are some simple suggestions for avoiding Batch failures?

CIM, RTIM, TCIS
Customer Interaction Manager, Real-Time Interaction Manager, TD Channel Integration Services
Teradata Employee

What are some simple suggestions for avoiding Batch failures?

What are some simple suggestions for avoiding Batch failures?

1 REPLY
Teradata Employee

Re: What are some simple suggestions for avoiding Batch failures?

No one likes to deal with a failed CIM job.  Nevertheless, failed jobs can and do occur.  While it is impossible to prevent all failed jobs, here are some suggestions that might help avoid at least some failures:

 

  • Test all changes on a test environment prior to implementing them in a production environment.  Changes include any change to the CIM environment such as metadata mapping changes, jdbc driver updates, application server configuration changes, etc.  Ensure that the test environment is properly synchronized with production.
  • Shut down the CIM application during planned database outages / maintenance (database backups, etc.).  Do not rely on CIM error handling features to recover from extended outages.  In general these are designed to handle transient, recoverable errors.  Do not use them in lieu of shutting down the application during planned outages.  CIM recovery features generally retry the operation for a specific period of time or a specific number of attempts.  Once those have been exhausted, the process will fail and if this occurs while an outage is in progress, the failure state may not be properly recorded.  This can short-circuit other recovery features and prevent them from recoverying  the application properly following the application restart.
  • Black-out periods do not prevent all SQL submissions to the database.  To guarantee that no application SQL hits the database, shut down the application.
  • Address failed jobs in a timely manner.  Failed jobs can hold component locks that can cause other jobs to fail due to lock timeouts.  See KCS article  KCS004659 for more information about lock timeouts.
  • When there are many jobs, they can exceed the number of available worker threads.  This can lead to job misfires.  A misfire occurs when a job cannot be scheduled to run due to insufficient work threads.  Once this delay exceeds a certain threshold, the application may skip / throw out the job run.   This can happen when a large number of communications have been scheduled to run around the same time.  If a large number of communications appear to run around the same time of the day, consider setting up the communications as a group batch job (group jobs can process both single and multistep communications).  This allows one to process numerous communications to run under a single group job.  This can help avoid exhaustion of the scheduling worker threads.
  • If one uses both the nightly and intraday group batch jobs, one might consider enabling the DEL_LOCK_COMM_INTRA_DAY_BATCH setting.   These jobs share the same list of CIM components (communications) so if their run schedules overlap, they can easily encounter lock timeouts.  The DEL_LOCK_COMM_INTRA_DAY_BATCH can avoid this by forcing the intrday to defer any conflicting locks to the other competing job.   For additional information on DEL_LOCK_COMM_INTRA_DAY_BATCH and other CIM settings, refer to Chapter 7 of the CIM Metadata guide