As part of the Viewpoint 15.00 release, the Viewpoint team built a brand new version of the Node Resources portlet. The primary purpose of this portlet continues to be to identify skew on a Teradata Database system. The original incarnation of this portlet required a fair amount of manual intervention in order to achieve this goal. The new version of this portlet includes a simpler user interface and a new algorithm to identify skewed resources (or “outliers”) automatically.
Since the Teradata Database is a massively parallel architecture, it’s important that all of the units of parallelism are performing approximately the same amount of work. If some of the nodes or VPROCs within the system are performing too much or too little work when compared with the system-wide average, this is called skew. When work for a specific query is skewed, the query isn’t taking full advantage of the power of the system, and therefore doesn’t complete as quickly as possible. When the work on nodes or VPROCs is skewed, this can affect the performance of the system and also reduce the effective capacity of the system.
There are three primary enhancements to the Node Resources portlet. The first is the use of a histogram to visually display the data distribution for a particular metric. The automatic calculation of “outliers” based upon the data distribution is the second improvement. The final significant change is the ability to analyze the data over a time range instead of just the last sample of data.
The visualization in the previous version of this portlet depicted a square for each node or VPROC on the system. For larger systems it was hard to see all the squares on a single screen, and this representation of the data didn’t really add much insight into the actual data for a particular metric. The new version of the portlet instead uses a histogram to plot the data for the selected metric. The histogram contains 20 buckets of equal size, and the height of each bar represents the number of nodes or VPROCs that fall into each bucket or range.
The red bars in the histogram represent the buckets that contain “outliers”, which are nodes or VPROCs that are significantly skewed. Outliers are calculated as resources that fall 1.5x above or below the interquartile range. This is a standard statistical analysis for finding outliers in a distributed data set. In this way, the portlet automatically calculates any nodes or VPROCs that are significantly skewed for the selected metric. For a system that is working in a reasonably parallel fashion it’s definitely possible that you won’t see any outliers in the histogram. If the histogram does show any outliers, you might want to investigate further to discover the cause of the skewing on your system.
The third significant change is the ability to analyze up to an hour’s worth of data while using this portlet. In Viewpoint 14.10 and earlier, the Node Resources portlet only reported data for the last sample period. This data typically represented the data for a minute or less of elapsed time on your system, which is too short a time period to reliably discover significant skewing issues on a system. The new version of the portlet lets you choose the last collection time as before, but also an aggregation of 5, 15, 30, or 60 minutes of data.
While viewing the main screen of the portlet, you can click on any of the bars in the histogram to drill down and view the data for just the nodes or VPROCs in that particular bucket. From the main screen you can also click the “Down” or “Outliers” bubbles to change the filter for the data grid so that only those particular resources are displayed. You can click on any of the rows in either of the data grids to drill down to a detail screen that displays all of the metrics for that particular node or VPROC. The detail screen is different for nodes, AMPs, PEs and other VPROC types so that only the applicable metrics for that particular resource are displayed.
This new version of Node Resources should make it much simpler to monitor and identify potential skewing issues across the nodes and VPROCs of your Teradata Database system.
Note that the Node Resources portlet only applies to Teradata DB systems whereas the Node Monitor portlet provides monitoring aspects for Aster or Hadoop system nodes.
Viewpoint 15.00 went GCA on April 9. Details on the release are available here: http://developer.teradata.com/viewpoint/articles/teradata-viewpoint-15-00-release-article
I was told by VP Engineering team that they opened a jira based on the suggestion box note that i have dropped way back in 2013 about the node down alert. I am just pasting the content here. PJust wanted to check whether is fixed in VP 15.
This is about the NODE DOWN alert in Viewpoint.
We recently saw the node down (3-02) issue on our production box, we recieved an alert on the same. But after couple of days, another node went down (4-02) and the first down (3-02) came online. But still VP showing the same alert (with Subject of the email like this:[Alert] PRODUCTION - Node Down (Source: Viewpoint, Type: Node)
****Body of the email looks like this****
Event Timestamp: 2013-02-13T14:02:11.640-07:00
Network A Use=
Disk Outage Request Average=
Average Node CPU Usage=14.761381
Average Node Disk Usage=97.31792
System Network A Use=58.241455
Node CPU Skew=3.1299293
Description: (Status = D)
***My question is***,
Is there any way we can see the node name (or number) in the email, so that we can have some idea about the exact node which is down currently on the system.
Something like a this in the subject line: "PRODUCTION - Node Down (Source: Viewpoint, Type: Node-2-03)" instead of "PRODUCTION - Node Down (Source: Viewpoint, Type: Node)"
The subject line for node alert emails is still the same in Viewpoint 15.00. Even though the Node Resources portlet changed, the alerting infrastructure behind the alerting didn't really change much as part of this release.
How about the status of disks (up or down, I currently have to use sysval to see what disk is down. Having to go 2 places uses up more time and I would like one stop shopping if possible....
The "Viewpoint" user collects necessary information from the Teradata Systems as part of VP DCS services. So, is it safe to assume that Teradata Customer will always see a Viewpoint User in Active Sessions for the Query Monitor portlet across all the VP managed TD systems. If Yes, for high metrics usage by VP user (Like CPU, IO etc), what can be considered as a major factors ?
The "Viewpoint" user collects data by running SQL (DBC/SQL partition) and running PM/API commands (MONITOR partition). You could see up to 4 connections on the DBC/SQL partition and 2 connections on the MONITOR partition due to the use of connection pooling. These connections will certainly be active at times when collection is occurring, but might not always be active. If you see high CPU or I/O usage from this user I'm assuming it will be on the DBC/SQL partition. If this does occur, you can use DBQL to determine what query is causing the higher than expected utilization and open a support incident. All of the testing and validation we have done with customers indicates Viewpoint should use less than 1% (and hopefully less than 0.5% of the system resources) to collect the data it needs.