Understanding the System Health Equation

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

The System Health Equation in Teradata Viewpoint is an excellent way to quickly evaluate the overall health of a Teradata Database system.  The equation can incorporate metrics surrounding a variety of aspects of the system, including CPU, skew, AWTs, disk space, query activity, and node and VPROC status.  It is highly customizable so the equation can be tuned to accurately report the health based on what the DBA deems appropriate for each separate Teradata Database system.  This blog entry will explain how the equation works, how it can be customized, and a few common gotchas around situations when the equation might show a status of down or unknown.

The System Health Equation is comprised of degraded and critical thresholds for a set of metrics that cover a variety of different aspects of a Teradata Database system.  When data for all of the metrics in the equation is available and being collected properly, the equation calculates the health as either HEALTHY, DEGRADED, or CRITICAL.  (Keep in mind that as of Teradata Viewpoint 13.11, the name of each health state can be customized in the Teradata Systems portlet.)  If at least one metric exceeds its degraded threshold, then the health is said to be DEGRADED.  Likewise, if at least one metric exceeds its critical threshold, then the health is said to be CRITICAL.  If all metrics are below their degraded threshold, then the system is HEALTHY.

The equation only takes into account metrics that are set to an "Enabled" state in the Teradata Systems portlet.  Any metrics that are set to a "View Only" state will be displayed in the System Health portlet, but not considered when determining the overall health.  If a metric is set to a "Disabled" state, then it won't be displayed in the System Health portlet nor will it be included in the calculation of the system health.  At the very bottom of the thresholds settings for the System Health equation are all of the canary queries that have been configured for this Teradata Database system.  A threshold is automatically added to the equation for each canary query, though the threshold is always added with the state as "Disabled".  The state for the canary query threshold can of course be enabled later if the DBA wishes to include that canary query in the equation.

Given the way that the state is taken into a degraded or critical state when a single metric exceeds its threshold, it's important to carefully choose the thresholds that are set in the Teradata Systems portlet.  When adding a new Teradata Database system to Teradata Viewpoint, knowledge of the system's performance is necessary to select the correct values.  A DBA might consult DBQL, Res Usage, or some other reporting mechanism to determine the correct initial values.  After a system is being monitored in Teradata Viewpoint, reviewing the Productivity portlet or one of the trend reporting portlets such as Capacity Heatmap, Metrics Analysis, or Metrics Graph should give a good indication of the typical range of values for these metrics over a given time period.

There are two final healths that can be represented in the System Health Equation: DOWN and UNKNOWN.  These states can be puzzling to understand, so here's the list of situations in which these states can be encountered.

A Teradata Database system is considered DOWN in the health equation if any of the following conditions occurs:

  • A canary query is enabled in the health equation and returns an error (such as a SQL parse exception or login failure).
  • The System Heartbeat canary query is enabled in the health equation and fails to complete in under 60 seconds.
  • Any other canary query is enabled in the health equation and fails to complete in under 30 minutes.

A Teradata Database system is considered UNKNOWN in the health equation if any of the following conditions occurs:

  • The System Stats collector has not completed execution in the last 48 hours.  (This might occur if the collector has been disabled or in rare cases if one of the APIs that this collector calls never returns and hangs the collector.)
  • A canary query that is enabled in the health equation has not completed execution in the last 24 hours.  (This might occur if the canary query is disabled in the Canary Queries section of the Teradata Systems portlet, the query has repeatedly returned an error, or it is configured to run during a custom time range such that the query doesn't execute for intervals of greater than 24 hours.)
  • The Teradata Database system returns data in the MONITOR PHYSICAL RESOURCES API where there are no nodes that are up and also have at least one running AMP.  (This situation should be extremely rare and would likely coincide with a severe problem on the Teradata database system.)

Based upon feedback from several different customers, Viewpoint 14.01 contains a new feature to address the 4 skew metrics that are part of the System Health Equation.  When the utilization of a Teradata system starts to drop down below 25%, it becomes more likely that the remaining work will be skewed to a certain extent.  What might be an intolerable level of skew when a system is fully utilized becomes much less of an issue when the overall utilization is lower.  With that in mind, the System Health Equation in Viewpoint 14.01 contains the following new settings to ignore skew as system utilization drops.

When the first option is activated, the Node CPU Skew and AMP CPU Skew metrics will be set to View Only when the overall CPU utilization falls below the specified value, in this case 50%.  This means that these 2 metrics will not be used to calculate the health of the system.

For example, assume you have CPU utilization of 12%, and all metrics are in the healthy range, except for Node CPU Skew, which is in the degraded range.  In previous versions of Viewpoint, the Node CPU Skew metric would cause the health of the system to be reported as Degraded.  However, with this new setting enabled, the health of the system would still be reported as Healthy.  If the CPU utilization were to go back above 50% and the Node CPU Skew were to remain in the degraded range, then the health would change to Degraded.

The same logic applies when the second option is activated, although this option controls the I/O skew metrics: Node I/O Skew and AMP I/O Skew.

Hopefully this post helps to explain the System Health Equation and will serve as a good point of reference should the health ever change to DOWN or UNKNOWN.

16 Comments
Teradata Employee
Just wanted to post a quick follow up discussing the skew metrics that are a part of the System Health equation. I received this question during the Viewpoint Ask the Experts forum at Partners and also via a comment on another post on Developer Exchange.

The skew metrics in the System Health equation allow you to change the health of the system when the CPU or I/O skew exceeds a certain threshold. In a massively parallel architecture such as Teradata it's extremely important to monitor skew to ensure that the resources in the system are being effectively utilized to deliver the expected performance from the overall system. Having skew on the system could be indicative of queries that need to be rewritten, poor data distribution across the AMPs, or a variety of other situations.

As the amount of work on a system decreases, skew becomes a less important metric and can also become misleading. If only a single query or two are executing on the system, it's very possible that this extremely small amount of work will not be evenly distributed throughout the system. This causes the skew to rise to values that might very well exceed the thresholds set in the System Health equation. When this occurs, the health of the system in Viewpoint could be reported as Critical or Degraded. However, most DBAs would agree that having, for example, high CPU skew when the overall CPU use is just a few percent of the system is not a problem.

Today, Viewpoint does not account for the overall CPU or I/O use when it changes the system health for high skew. However, we understand the problem and have an open enhancement request to add a feature to only degrade the health for a skew metric when the overall value of the skewed metric is above a certain value. For instance, you would be able to say that CPU skew would only be considered as part of the health equation when the overall system CPU was greater than 10% (or whatever value you specify).

Hopefully this helps to clarify how the skew metrics are considered in the system health equation in Viewpoint, as well as the current thought on the direction we plan to take to address artificially high skew values in the future. If you have any questions or comments please feel free to post them here.
Enthusiast
Thanks for your insight regarding system health equation in Viewpoint but I have a question, which is somehow related to your comment above. Oftentimes, I notice that either AMP CPU Skew or AMP IO Skew exceeds its threshold even if there are NO active sessions except my own viewpoint session. What could be the main reason behind this? I'm thinking that maybe there are some background processes which are not visible thru viewpoint. If so, what are these processes or sessions and how can we check them out? Your reponse is much appreciated. Thanks!
Enthusiast
I experience the same quite often. Today for instance, only idle sessions open with AMP IO/CPU and Node IO/CPU skew > 90% in some combination nearly all day. Also may nodes reported disk/CPU usage and disk I/O > 50% throughout the day. Any thoughts would be greatly appreciated.
Teradata Employee
CPU skew is the CPU on the hottest AMP divided by the average CPU. Since this is a ratio, as the denominator (in this case the average CPU) decreases, the CPU skew increases. As the average CPU starts to approach 0, it's very easy to get a high CPU skew value. So, what is being described in these comments is the expected behavior. However, as I mentioned in the comment above, the Viewpoint team is investigating how we can incorporate the average CPU into the system health equation to prevent a high skew from changing the health state when the average system CPU is low.
Junior Supporter
We frequently get "Canary Query Read timed out Issue " error message.
Because of this error we get false System Down Alerts.
Is there is way to fix /avoid this false alert.

In our System Health Setup we have the ' System Heartbeat' disabled. However under Canary query setup we have have enabled a system heartbeat "Sel * from dbc.dbcinfo" .
It is set to collect data every 2 mins. So my question is "Does this canary query still impact the system health ( down/healthy) portlet?"
Teradata Employee
Only canary queries that are enabled in the System Health equation will cause the system to be reported as down if they fail. If you don't want this to occur, then make sure those queries are read only or disabled in the equation.
Enthusiast

Hello Steve,

In Teradata Viewpoint System Health Portlet, on multiple occasions, I have encountered High CPU Skew or High I/O [More than 80 %] when the number of Active Sessions is Zero. How is this Possible? The System-Mode Activities cannot contribute towards so much Skew or I/O activity.

I saw the Response above, and Understood the Whole Concepts...The Avg CPU moving gradually towards 0 will make the CPU Skew very high. Having said that, with no Active Sessions, why the CPU on the Hot-AMP is high enough to make the CPU Skew exceed Threshold. Any Kernel-Mode Operations so Intensive ?

Enthusiast

Also, to the "Down" State...We have Canary Enabled...

Having said that, I have seen the System as "Down" for any AMP/PE/Node Down Scenario. Could you confirm this to a factor as well in addition to the reasons mentioned above for "Down" Condition.

Teradata Employee

With respect to the high skew with no sessions, there is always going to be a certain amount of activity on the nodes from various background processes that are part of the database software.  This work could definitely cause the high skew you are seeing even with no sessions logged on.

The number of down components does not lead to a DOWN system health state.

Enthusiast
Thanks Steve for the response

Having said that.if a Component is down what will be the Systems status in the system health portlet?
Teradata Employee

The number of down components works just like all of the other values in the system health equation in that the status is determined by the thresholds that are configured for that metric.

Enthusiast

Stever,

Regarding your response that even with no sessions still high AMP I/O skew is visible, so when sessions are logged on or users are running queries, does that mean AMP I/O skew is sum of background processes and foreground processes?

Isn't this information misleading? So I am to assume that the only values of concern, with respect to active sessions, are the CPU skew and I/O skew?

And if the AMP I/O skew is high even when no sessions are present it can be attributed to the background processes?

Teradata Employee

In the System Health portlet and other places in Viewpoint, the total CPU consumption of the system and the nodes of the system are reported.  This data comes from the PM/API MONITOR PHYSICAL RESOURCE call.  I believe this includes both CPU consumed while servicing sessions as well as other work that is not related to the actual servicing of the sessions.  There will always be a small amount of overhead in running the OS and all the software that comprises the database.  However, you don't want to exclude this work from CPU metrics as you want to know when the system reaches its capacity (i.e. 100%).

I think the point to take away from the skew metrics is that in an MPP system like Teradata skew is not an important metric when the overall utilization is very low.  At low utilization, it becomes less likely and in some cases impossible to make the system completely parallel.

Enthusiast

Thanks for the prompt response!

Cheers

Enthusiast

Thank you for the nice artice. However, I have a question regarding health condition.

If I have System heartbeat enabled in health equation but canary query is disabled in Teradata System Portlet, will system heartbeat be accounted for health condition ?

Teradata Employee

If you enable System Heartbeat in the health equation but disable the execution of the query, the system health will eventually be calculated as UNKNOWN.  Viewpoint uses the last System Heartbeat execution time during the previous 24 hours when calculating system health.  Therefore, once the system heartbeat has not been collected for at least 24 hours, the health will start being reported as UNKNOWN.  While there might be some very short term reasons to disable a canary query that's part of the system health equation, I would not recommend doing so for any extended period of time.