The System Health Equation in Teradata Viewpoint is an excellent way to quickly evaluate the overall health of a Teradata Database system. The equation can incorporate metrics surrounding a variety of aspects of the system, including CPU, skew, AWTs, disk space, query activity, and node and VPROC status. It is highly customizable so the equation can be tuned to accurately report the health based on what the DBA deems appropriate for each separate Teradata Database system. This blog entry will explain how the equation works, how it can be customized, and a few common gotchas around situations when the equation might show a status of down or unknown.
The System Health Equation is comprised of degraded and critical thresholds for a set of metrics that cover a variety of different aspects of a Teradata Database system. When data for all of the metrics in the equation is available and being collected properly, the equation calculates the health as either HEALTHY, DEGRADED, or CRITICAL. (Keep in mind that as of Teradata Viewpoint 13.11, the name of each health state can be customized in the Teradata Systems portlet.) If at least one metric exceeds its degraded threshold, then the health is said to be DEGRADED. Likewise, if at least one metric exceeds its critical threshold, then the health is said to be CRITICAL. If all metrics are below their degraded threshold, then the system is HEALTHY.
The equation only takes into account metrics that are set to an "Enabled" state in the Teradata Systems portlet. Any metrics that are set to a "View Only" state will be displayed in the System Health portlet, but not considered when determining the overall health. If a metric is set to a "Disabled" state, then it won't be displayed in the System Health portlet nor will it be included in the calculation of the system health. At the very bottom of the thresholds settings for the System Health equation are all of the canary queries that have been configured for this Teradata Database system. A threshold is automatically added to the equation for each canary query, though the threshold is always added with the state as "Disabled". The state for the canary query threshold can of course be enabled later if the DBA wishes to include that canary query in the equation.
Given the way that the state is taken into a degraded or critical state when a single metric exceeds its threshold, it's important to carefully choose the thresholds that are set in the Teradata Systems portlet. When adding a new Teradata Database system to Teradata Viewpoint, knowledge of the system's performance is necessary to select the correct values. A DBA might consult DBQL, Res Usage, or some other reporting mechanism to determine the correct initial values. After a system is being monitored in Teradata Viewpoint, reviewing the Productivity portlet or one of the trend reporting portlets such as Capacity Heatmap, Metrics Analysis, or Metrics Graph should give a good indication of the typical range of values for these metrics over a given time period.
There are two final healths that can be represented in the System Health Equation: DOWN and UNKNOWN. These states can be puzzling to understand, so here's the list of situations in which these states can be encountered.
A Teradata Database system is considered DOWN in the health equation if any of the following conditions occurs:
A Teradata Database system is considered UNKNOWN in the health equation if any of the following conditions occurs:
Based upon feedback from several different customers, Viewpoint 14.01 contains a new feature to address the 4 skew metrics that are part of the System Health Equation. When the utilization of a Teradata system starts to drop down below 25%, it becomes more likely that the remaining work will be skewed to a certain extent. What might be an intolerable level of skew when a system is fully utilized becomes much less of an issue when the overall utilization is lower. With that in mind, the System Health Equation in Viewpoint 14.01 contains the following new settings to ignore skew as system utilization drops.
When the first option is activated, the Node CPU Skew and AMP CPU Skew metrics will be set to View Only when the overall CPU utilization falls below the specified value, in this case 50%. This means that these 2 metrics will not be used to calculate the health of the system.
For example, assume you have CPU utilization of 12%, and all metrics are in the healthy range, except for Node CPU Skew, which is in the degraded range. In previous versions of Viewpoint, the Node CPU Skew metric would cause the health of the system to be reported as Degraded. However, with this new setting enabled, the health of the system would still be reported as Healthy. If the CPU utilization were to go back above 50% and the Node CPU Skew were to remain in the degraded range, then the health would change to Degraded.
The same logic applies when the second option is activated, although this option controls the I/O skew metrics: Node I/O Skew and AMP I/O Skew.
Hopefully this post helps to explain the System Health Equation and will serve as a good point of reference should the health ever change to DOWN or UNKNOWN.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.