While working on capacity planning for our Teradata Appliances, I was concerned that IO capacity migiht be the constrained resource, rather than CPU or storage. I needed a way to monitor and forecast IO utilization, and bandwidth consumption was not an effective metric due to its sensitivity to our highly volatile ratio of reads to writes. I researched the Teradata Forum and was able to develop a simple way to monitor IO percent busy in Teradata Appliances (this method also works for Teradata Enterprise class systems). Sample SQL is below.
However, now that I can track IO percent busy, I do not know at what point IO is so busy that performance is adversely affected. Questions:
1. Does anyone know a value for IO percent busy that a Teradata system should not exceed (i.e., the threshold limit)?
2. If I am approaching the threshold limit for IO percent busy, how should I determine the number of additional nodes for capacity planning purposes?
SAMPLE SQL FOR TRACKING IO PERCENT BUSY:
'System_ID' AS TD_System /* identify your Teradata system */
,EXTRACT(HOUR FROM a.TheTime) AS TheHour
, EXTRACT (MINUTE FROM a.TheTime) AS TheMinute
,SUM(LdvOutReqTime) AS Total_LdvOutReqTime
,COUNT(LdvOutReqTime) AS Num_LdvOutReqTime
,(Total_LdvOutReqTime / (Secs*Num_LdvOutReqTime))/100 AS PCT_IO_BUSY /* calculation of % IO Busy */
FROM PDCRINFO_bk.ResUSageSldv_Hst a INNER JOIN PDCRINFO.CALENDAR c
ON a.TheDate = c.calendar_date
INNER JOIN PDCRINFO.Shifthour sh
ON TheHour = sh.shifthour
WHERE Secs = 600
AND NodeID NOT IN (
/* list the parsing engine nodes and exclude them from the calculation */
AND TheDate BETWEEN ‘start_date' AND 'end_date’ /* enter the start and end dates for analysis */
AND c.Calendar_date BETWEEN ‘start_date' AND 'end_date’ /* enter the start and end dates for analysis */
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13
ORDER BY 1,2,3,4,5,6,7,8,9,10,11,12,13
These are certainly challenging questions to answer since I/O capacity is difficult to calculate as you point out due to variations in reads to writes and any approach is just an estimate and subject to a wide variance of error due to the variables involved. Also there are a lot of factors to consider in capacity planning when identifying I/O bound workloads.
My short answer to number 1 is that for a Teradata system or any platform for that matter, the entire capacity available (CPU and I/O) should be used as efficiently as possible, otherwise resources are left unused. Key word is "efficiently" such that there is no oversubscription of work needing to be done when approaching 100% and there is still room for growth. So that threshold will depend on the growth trend and the lead time to grow the system such that when you hit that limit there is time to tune or grow the platform.
My short answer to number 2 is to first determine the I/O bandwidth of I/O per node and to estimate the workload I/O requirements in terms of I/O Operations Per Second (IOPS) or MB/s. I noticed you used the outstanding requests per logical device (ldvoutreqtime). In Teradata 14.10 new spare columns were introduced in SPMA to begin tracking I/O token allocation (IOTA) that measures the full potential of I/O for a device. In Teradata 15.0 three new columns for IOTA were added to the SLDV ResUsage table; FullPotentialIOTA, CodPotentialIOTA, and UsedIOTA.
My longer answer:
There are other considerations to capacity planning than just the device I/O, the fluctuation of reads vs. writes need to also be considered as well as the usage of cache memory, and the concurrency of operations occurring. I/O is the process of moving data in and out of storage and that storage may be a hard disk, solid state disk, flash memory, USB drive, or other media in which data can be stored and retained. The speed of I/O is typically measured in I/O operations per second or IOPS and is directly impacted by the transfer speed. For example the transfer speed of a hard disk is typically measured in milliseconds and writing data is slower than reading data so when you have a lot of processes executing in a platform where some are reading data and some are writing data, the IOPS and therefore the I/O capacity will fluctuate from minute to minute, unlike the CPU which holds a constant of how many instructions can be executed per second. IOPS can be converted to MB/s by multiplying IOPS by the amount of data per I/O operation. The transfer speed of flash memory can be measured in microseconds, and Dynamic Random Access Memory or DRAM typically referred as simply memory is in nanoseconds.
In a Teradata appliance hard disk I/O is also assisted by the File Segment Cache (FSG) and potentially by Teradata Intelligent Memory (TIM) or In-Memory Optimization (TD15.10) will also alter the I/O bandwidth throughput.
Different workloads will impact the I/O capacity differently during which time they are active in the appliance. If you have Performance Data Collection and Report (PDCR) there are several reports that provide views of I/O usage in terms of MB/s and reads vs. writes. I/O bandwidth will be much higher in a workload period where you have more reads than writes since as mentioned earlier writes are slower than reads. Workloads will also experience a natural growth as time goes by and new projects will introduce new workloads that will also impact capacity planning. You can use a linear regression model to forecast natural growth, and forecasting new workloads involve communication with development teams conducting testing and measuring consumption of those workloads.
Also consider the peak periods of the distribution of workloads. Peak periods are when the most queries are executing concurrently reaching capacity of the platform where as other periods may be far less consumed.
Another factor is using compression. If you suspect or have identified that I/O is the constrained resource and there is plenty of CPU available, then both block level compression (BLC) and Multi-Value Compression (MVC) can be useful in reducing the I/O blocks that need to be transferred. Although CPU usage may increase if it is still less than 100% then you may actually be balancing the resource consumption of the platform. Another measure to consider is the CPU to I/O ratio to better determine the balance of CPU and I/O consumption.
Yet another factor is using the columnar feature for candidate tables that could benefit meaning tables that may have wide rows in which only a few columns are most frequently retrieved in large volumes. This essentially reduces the I/O blocks to transfer since the entire row does not need to be retrieved. For example, if the row size is 256 bytes and the most frequent columns that are retrieved comprise of only 52 bytes then queries that retrieve from a columnar table using only those columns would eliminate 80% of the row that would otherwise need to be retrieved which could be significant I/O savings when retrieving millions or rows.
Bottom line is that there are variety of factors to consider in capacity planning in regards to measuring and efficiently using I/O bandwidth as part of capacity planning. Hope this helps.