A reliable rebooting mechanism for Data Mover Services

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Data Mover is a Teradata Application that allows users to copy databases or tables between Teradata systems. It is a J2EE-based application that is composed of three major code components: the Client (Command-line interface or Viewpoint portlet), the Daemon, and the Agent. The Daemon is the central piece of the application and the Agent is a worker unit that does the actual job of moving data. Both the Damon and the Agent are deployed in a Linux managed server. There are two other dependent Data Mover components installed in the same box: the repository DBS for storing Data Mover job data and the Active MQ application used as the messaging service provider.

The Problem

The four major components of the Data Mover, Daemon, Agent, Repository and Active MQ are all installed on the same Linux machine. Each component is started by an execution script created in the /etc/init.d/ directory. When the Linux machine is rebooted for any reason, the four components will be started automatically. However there are dependencies between these components. The Daemon will depend on the repository DBS, Active MQ, and network ports. The Agent will depend on Active MQ and network ports. If a dependent component is not ready (started), the Daemon or Agent will fail to start.

The Solution

There is no guarantee that the Data Mover services will be started and ready for use in the same order every time the server is rebooted. It would be useful if we could have the Daemon and Agent components keep trying to start when the required resource is not available though. This solution will work if we can take care of the following key points. Here I am using the Daemon component as the example.

  • Perform retry execution if the previous execution throws any exception related to resource availability.
  • Release the Apache Spring context and the communication ports that were grabbed in the last unsuccessful execution.
  • Print out a user friendly error message in the log file when the execution fails.
  • Have a short break/sleep time between execution failures and the next retry.
  • To prevent unlimited retries that could potentially fill up the log file and disk space, increase the Java log level to minimize the log lines after a certain number of failure counts. Once the service can be started successfully, revert back to the original log level.

Here is the sample Java code that fulfill the above solution.

        boolean failed = true;
int tryCount = 0;
while (failed)
{
tryCount ++;
try
{
ctx = new ClassPathXmlApplicationContext(...);
socketWatchdog = (SocketWatchdog)ctx.getBean("socketWatchdog");
...
// successful if you can reach here
failed = false;
if (tryCount > 1)
{
logger.fatal("*********** Daemon started successfully after previous failures!"
+ "recovering logging level");
logger.getRootLogger().setLevel(originalLogLevel);
}
}
catch (Exception e)
{
// do some clean ups to avoid resource conflicts in the next try.
if (ctx != null)
{
ctx.close();
}
if (socketWatchdog != null)
{
socketWatchdog.stopWatching();
}
// print out error messsage
logger.fatal("************ Failed when trying to Start up Deamon, will retry in " +
RETRY_INTERVAL + " seconds, fail count " + tryCount);
logger.error("************ Failed due to error ************", e);
if (tryCount == STOP_LOGGING_INTERVAL)
{
logger.info("Disabling normal loggings (changing logging level to fatal) after "
+ STOP_LOGGING_INTERVAL + " tries");
logger.getRootLogger().setLevel(Level.FATAL);
}

// sleep before retry
Thread.sleep(1000 * RETRY_INTERVAL);

}
}

This solution was implemented successfully in the Data Mover 13.10.00.03 release. The Data Mover Daemon and Agent services can always start successfully using this retry mechanism to wait for their dependent resources. Both services will start working properly as soon as the internal DBS repository and the ActiveMQ service are ready for use.

1 Comment
The application seems very useful! It is best to get the back ups during system crashes!

kingman homes for sale