System Down in Viewpoint

Viewpoint
Teradata Employee

System Down in Viewpoint

Hi All,

 

We get system down in Viewpoint for one of our DEV servers every Monday morning between 6AM and 6:15AM with the database up and running without any issues.

I understand System Down condition in Viewpoint could be due to a number of factors however the typical known factors such as

  • A canary query is enabled in the health equation and returns an error (such as a SQL parse exception or login failure).
  • The System Heartbeat canary query is enabled in the health equation and fails to complete in under 60 seconds.
  • Any other canary query is enabled in the health equation and fails to complete in under 30 minutes.

have all been ruled out. There's no other heartbeat query than a typical select from dbcinfo that completes in well under a second. There's no impact noted on anything else on the system. Could it be network related? If it is, shouldn't that be impacting other queries on the box?

The viewpoint data is backed up at midnight.

What other things I should be looking at?

4 REPLIES
Teradata Employee

Re: System Down in Viewpoint

Anything that causes the heartbeat to fail or time out will result in a "down" status.

Since the timing seems to be predictable, I would look for something scheduled at the start time of the false "outage" report.

 

To answer your question, yes - it could conceivably be network related. And the specific network path for Viewpoint to DEV might not impact other users' access. But a regularly occurring network issue seems rather unlikely.

Teradata Employee

Re: System Down in Viewpoint

Thanks. I found the following error/exception in the dcs logs. The heartbeat does run every couple of minutes (as the logs show) however it might have been running after getting these failures? We get multiple of these every other minute for a period of around 10 mins. Any pointers what this could mean as the regularity of it (same day every week same time window) minimize the chance of a network issue as noted already. Nothing else seems to be impacted from the viewpoint or database end.

 

2019-01-28 06:15:53,468 ERROR [quartzScheduler_Worker-8] {Collector=canaryQueryCollector, ExecutionId=5543e16f-b830-434e-b77c-5ebb7a0c07fd, System=<DEV SYSTEM NAME>} collectors.CanaryQueryCollector.executeQuery(373) - Unable to execute canary query!

org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.01] [Error 1277] [SQLState 08S01] Login timeout for Connection to <DEV SYSTEM NAME> Mon Jan 28 06:15:53 GMT 2019 socket orig=<DEV SYSTEM NAME> cid=7fb8008c sess=0 java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:344) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at com.teradata.jdbc.jdbc_4.io.TDNetworkIOIF$ConnectThread.run(TDNetworkIOIF.java:1242)
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.01] [Error 1277] [SQLState 08S01] Login timeout for Connection to <DEV SYSTEM NAME> Mon Jan 28 06:15:53 GMT 2019 socket orig=ngmtdd01 cid=7fb8008c sess=0 java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:344) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at com.teradata.jdbc.jdbc_4.io.TDNetworkIOIF$ConnectThread.run(TDNetworkIOIF.java:1242)

Teradata Employee

Re: System Down in Viewpoint

Socket connection timeouts can be one of the most difficult things to diagnose. You don't have an error that would point to anything in particular, just no response at all for a relatively long time. You might need to monitor the viewpoint server at the OS and/or network level at the time in question to isolate the problem.

 

One thought: If the Viewpoint server is configured to use DNS for hostname resolution, do DNS "not found" requests take extra long at that time of day/week? The default JDBC COP Discovery logic relies on promptly getting a failure response for a non-existent COP number.

Teradata Employee

Re: System Down in Viewpoint

From experience i can cofirm you in many cases we see the system down state when the canary query fails to complete due to many reasons (system heavily used, error in canary query, etc.) One suggestion for you is to set up an alert for system down only if this condition is met for at least n minutes continuosly (4 or 5 minutes) in order to avoid false positive alerts.

Hope this helps, Daniele