Converting CPU Seconds, Old Node Type to New Node Type

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Have you ever tried to figure out ahead of time how many CPU seconds an application would require after upgrading to a new hardware platform?    I talked about one approach to solving this problem at the recent Partners Conference in San Diego, and would like to share my approach with you.

First, there a couple of assumptions we need to agreed upon when it comes to converting CPU seconds from one node generation to another:

  1. More powerful CPUs will use fewer CPU seconds to perform the same work
  2. Different node-types may have different numbers of CPUs

 

So when it comes to making a CPU-second conversion you will want to consider both the power differences of the nodes, and any difference in the number of CPUs per node.

Most of us are used to CPU seconds, like everything else around us, becoming more powerful with each new generation.  But that has changed recently with some of the new chip architectures.   With technologies such as multi-core designs and hyper-threading, in which multiple cores exist within each chip, inter-core communication and sharing reduces the strength of each individual CPU second.

 

It is important to not jump to the conclusion that this somehow means a step backwards to weaker nodes. As long as you are combining a greater number of CPUs (cores) on the node, even if each is slightly weaker, their larger numbers will suffice the make the node more powerful.   More CPUs per node make for more powerful nodes, even with less powerful CPUs.

An important player in doing a translation of CPU seconds across different node types is TPerf.  The different TPerf ratings of the two nodes, the old, slower node, and the new, faster node, can help to do the conversion.   TPerf is a measure of Teradata node throughput potential which uses the 5100M from 1998 as a baseline.   If your node’s TPerf rating is 5, it means your node is 5 times more powerful than the 5100M node.   TPerf is an indicator, not a predictor of performance change.

 

Standard TPerf, the one that most of you are familiar with, represents the entire node, including the disk sub-system.   There is a variant of TPerf, called “Baseline Uninhibited” TPerf, that represents raw CPU power only.   Because it focuses solely on CPU differences, it is best to use baseline uninhibited TPerf when converting CPU seconds from one node type to another.

Below is a graphic that illustrates the two steps you need to take to do this CPU-seconds  conversion.  This example assumes that the number of CPUs per node are the same.  However, both Step 1 and Step 2 below factor in the number of CPUs so that we can build a formula that will be useful in cases where the number of CPUs might be different.   This example below uses artificial TPerf numbers just for the sake of making the math easy.  Basically, what is being shown is that if you new node has double the TPerf rating and the same number of CPUs, an application’s CPU seconds will be cut in half on the new platform.  With twice the power, it takes ½ the effort to do the same work.

 

This next graphic uses the same example, but combines step 1 and 2 into a single formula, by moving some of the factors around.  Here we have produced a single formula that can be used when converting CPU seconds from one node type to another.

 

This next graphic takes this formula and applies it to actual nodes types with real-world TPerf numbers, the 5400 and the 5450.  Both of these node types have the same number of CPUs per node.

 

Since we already have a formula with some flexibility built in, we can see how it works in the case where the number of CPUs has increased in the new node type.  In the example below, CPU seconds from an application running on a 5550 node are converted to CPU seconds for the same application running on a 5600 node. 

 

Notice in the example shown above that the number of CPU seconds is somewhat greater on the new node type.  This is due to the increased numbers of CPUs on the 5600 and the sharing overhead that accompanies it.  Each CPU second does slightly less work, but since there is many more of them, the node itself is more powerful and the work will be accomplished in a shorter time.

Keep in mind that advancing technologies can introduce some variation in post-upgrade CPU usage numbers.   TPerf by itself can only provide a ball-park estimate, and is only a starting point.   Operating system changes can result in CPU being handled or reported differently.  Shifts in how the node is architected can also be a factor in  CPU usage.

But hopefully, this approach to converting CPU seconds can be a starting point for simple extrapolations and assessments when you are preparing for, or are in the midst of, a hardware change.

17 Comments
Enthusiast
Carrie, thanks again for the explanation and this is very helpful. Can you help clarify that even though the CPU time for servicing a request might increase with a newer node with more CPUs, the walltime will reduce considerably. By using your example the wall time on oldnode = 10,000/8(CPUs)= 1,250 and for the new node, it will be 11,535/16(CPUs) = 720 seconds.
Teradata Employee
Hi TeraAbraham,

Yes, you are correct. That is a good point to mention.

But let me add that it might be difficult to observe this as precisely as your numbers indicate, because the work being measured would have to be the only thing demanding CPU from the node (no contention for any of the CPUs), and the work being measured would also have to be able to consume continuous CPU (no contention from other resources).

But you are looking at this the right way.

Thanks, -Carrie
Enthusiast
Carrie, Thanks for the clarification.
How would we measure the difference in cpu utlization when we move across platforms. Example, when we move an application from an Enterprise class node (5600h) to a 2580 Appliance. We see increase in AMPCputime, any assistance in understanding the correlation between the two types of systems would be very helpful.
Teradata Employee
The number CPUs (8 cores) and the CPU power per node is identical on the 5600H and the 2580 platforms. So your CPU seconds should be worth the same value across node types: One CPU second on the 5600 platform should do the same work as one CPU second on the 2580.

Appliance nodes do not have Tperf ratings, and in areas other than CPU (such as number of disk or AMPs per node, size of memory) the node does have differences from the similar EDW node. So there is no direct correlation between the performance of work on a 5600 and a 2580.

Under some conditions, a change in the total number of AMPs on a platform can impact the level of CPU that certain applications use. Also, any increase in user queries that end users submit after moving to the new platform will also increase CPU usage levels, as I am sure you know. Also, with hardware differences, including a change in the number of AMPs, query plans can change, and this can lead to more or less CPU being consumed. It's possible that statistics are not thoroughly collected on one or the other platform.

There are numerous reasons why there could be somewhat of a difference in reported CPU comparing the two platforms, even when the CPU on the node is of the same power. If you conclude that your platform is performing in unexplainable ways, please contact the support center and discuss your concerns with them. They may be able to provide more specific information about what is happening in your particular case.

Thanks, -Carrie
Enthusiast
Hi Carrie. If I have a TASM rule in which work gets demoted after 1000 CPU seconds, would I adjust this value using this calculation?
Enthusiast

Hi all,

recently we had a hardware upgrade on our system. Our CPU have increased from 16 to 24

and our AMPS are not 920 from 768.

But I have noticed that the AVG AMPCPUTIME has now increased.

I have compared the data set for pre and post 30 days of upgrade.

Stats collection process havent change since post upgrade.

Can someone explain or colerate that why this AVG AMPCPUTIME has increased and is it a good sign or not. BCZ currently I myself have got confused that what verdict I shall give to client?

Teradata Employee

If your average CPU time has increased after an upgrade there are several factors that could be responsible for this:

- While you may have more CPUs per node after the upgrade, it is possible that each CPU by itself is less powerful.   Under those conditions, more work could be accomplished system-wide with the new config, but each CPU second would do less work than before, so more CPU seconds would be required to complete a query.  You can check with your CS people at your site or talk to your account team to find out if your individual CPUs are less powerful on the new configuration.

- It is possible that more data has been added onto the platform since the upgrade so each individual query is accessing more data and doing more work.  To look into that, check the estimated rows processed per query in the explain, before and after, or the IO counts in DBQL.  Also you can cross check table sizes before and after the upgrade.

- It's possible that a different mix of queries are running since the upgrade took place, queries that are more complex and do more work.

-- It is possible that workload mangement options have changed since the upgrade.  Perhaps queries that were poorly written before the upgrade were being aborted or rejected before they started to run, but that is no longer happening post-upgrade.  Check all workload management settings for changes such as that.

- Query plans may be different after the upgrade. It is possible that some queries now have plans that are not optimal and are resulting in more work being done in the database. The optimizer uses the hardware configuration as input into creating the plan.  Changes to the number of AMPs can lead to different join geographics and sequences.  Statistics need to be recollected after the upgrade to ensure you are getting the best possible plans.  If there are gaps in the statistics collection strategy before the upgrade, it is possible that after the upgrade those gaps in statistics collection are causing more problems because of the hardware changes.  You can look for ways to improve statistics collections so they are more aligned with standard recommendations.

There may be other causes for what you are seeing.  Open an incident with the support center or bring in professional services expertise if more support is required in identifying the source of the discrepancy.

Thanks, -Carrie

Enthusiast

Hi Carrie,

Our prod system is approaching the capacity limits and we are looking to expand the system but the question is how do I calculate the no of nodes or the space that need to be added in to my exisitng system is there any formula to calucalte.

Could you please helping me in  calculating  based on the over all CPU growth or over data growth.

Thanks.

Teradata Employee

To  know how many of the new nodes you need, you have to track past, current, and expected future CPU usage, and determine how much CPU you will require after adding new nodes. It is imporant to consider any planned new workloads and estimated changes in demand from current applications.  Then you can use the examples shown above to determine what the CPU usage you are using today would be on newer nodes types, as well as how much CPU you need in the future to support growth and new applications.   From there, you should be able to extrapolate the number of nodes it would take to supply that much CPU power.

Space is similar:  How much space did you use in the past, in the present, and based on trending and new applications or increased demand, will you need in the future.

If you confused about this process, I'd suggest you contact someone in Teradata Professional Services for assistance in understanding capacity planning.    It can be tricky, and every site is different.

You can also leave a question on Teradat Forum, looking for other site's experiences in this area.  If you have access to past Partners Conference presentations, that's another good source for real world experiences.

Thanks, -Carrie

Thanks, -Carrie

Enthusiast

Hi Carrie,

Kindly inform on how to calculate the TPERF of a system from RESUSAGE tables.

Thanks in advance.

Teradata Employee

Subramanian,

I'm not familiar with the formula or calculations involved in determing the TPERF value for a given hardware node.   I believe those are internally determined by Teradata Engineering.   

The best way to get the TPERF rating for your nodes is to talk to your Teradata account team or your Customer Support representative.

Thanks, -Carrie

Enthusiast

Hello Carrie,

Recently, I have noticed the CPUTime Column is not available on 14.10 ResUsageSPS table (which was there in 13.10 SPS Table). 

Please have a glance on my understanding is accurate or not on below aspects:

1) On 14.10, there is still CPUTime column available, but way to access that is via ResSPSView instead of ResusageSPS table.  

Second way is, they can simply sum up CPUUServ and CPUUExec columns to get the CPUTime. So every time, instead of summing up all the 6 columns related to the CPU utilization of AWT/Node/Dispatcher/Parser/..etc, they can use 2 columns for less complexity.

Because  CPUUServ =  (CPUUServAWT + CPUUServDisp + CPUUServMisc)  and

                   CPUUExec = (CPUUExecAWT + CPUUExecDisp + CPUUExecMisc))

2) Gap in 14.10:

Until 13.10 verison CPUTime column is defined as “Milliseconds of CPU time consumed by all tasks.” But as per 14.10 view on WISEPROD, sum of (CPUUExec+ CPUUServ) gives us CPUTime in Centi seconds. Because both columns are expressed in Centi econds.

So in order to be coherence with 13.10, the CPUTime column should have been represented as below to see the result in Milli seconds. Otherwise users will notice the difference. 

((CPUUServ + CPUUExec)/10) AS CpuTime”    instead of        (CPUUServ + CPUUExec) AS CpuTime

Note: 1centi sec=10 mill sec

Teradata Employee
Geeta,

The reason  CpuTime was removed was because it was duplicating the information provided by the sum of the other fields.  We were reducing the cost of the SPS table by removing unnecessary columns.

The sum is available in the view as CpuTime.

I believe you assumptions in your first question are correct.

In your second question is it possible you are comparing CPU from SPMA and CPU from SPS?  All the CPU columns in the SPMA table are reported in centiseconds.   All the CPU columns in the SPS table are reported in milliseconds. 

If you are comparing the CPU across the two different tables you will need to divide the CPU usage in the ResUsageSPS table by 10 to make it comparable to the CPU in ResUsageSPMA table.  

Thanks, -Carrie

Enthusiast

Thank you for your comments Carrie,

On my 2nd question, here is what i am trying to understand.

Until 13.10 Version,  I used to see the CPUTime column result in Milli seconds (which is also mentioned in Resource Usage Macros and Tables 13.10 manual under CPUTime Column).

Where as in 14.10, as the CPUTime column is removed, I am depending on  ResSPSView OR in some queries I am summing up (CPUUServ + CPUUExeccolumns where ever I need CPUTime. But as both CPUUServ and CPUUExec columns are expressed in Centi seconds in 14.10, I should divide the result of (CPUUTime+CPUUExec) with 10 in order to get the CPUTime in MilliSeconds. 

Teradata Employee

Geeta,

In 13.10 the CPUTime is provided in the SPS table and is reported in milliseconds.  This column is removed from the table on 14.0 and up.

In 14.10, the CPUTime is provided in the SPS view and is also reported in milliseconds.  Therefore the CPUTime reporting units are the same on 13.10 and 14.10 and there is no need to convert the time units on 14.10.

This is how CPUtime is computed in the view in 14.10:

(CPUUServAWT + CPUUServDisk + CPUUServMisc) as CPUUServ,

(CPUUExecAWT + CPUUExecDisp + CPUUExecMisc) as CPUUExec,

(CPUUServ + CPUUExec) as CpuTime,

CPUUServ [AWT|Disp|Misc] are in millisconds, CPUUExec [AWT|Disp|Misc] are in milliseconds.  Therefore CPUTime is in milliseconds in 14.10.  You can check the 14.10 ResUsage manual, the chapter on the SPS table and validate this. 

Thanks, -Carrie

Enthusiast

Carrie,

I am seeing both the documentation and the view definitions reflecting the units as "centiseconds" in 14.10.  That is where exactly my confusion began.

I am sending the details in email, please review and let me know if I am checking something else or wrong documents.

Teradata Employee

Geeta,

The definitions you show below for CPUUServ and CPUUExec are taken from the ResUsageSPMA table (Chapter 6 in the ResUsage manual), which reports node-level metrics.  They are not taken from the ResUsageSPS manual. 

The ResUsageSPMA table expresses CPUUServ and CPUUExec in centiseconds.  This is true for both 14.0 and 14.10.

However, your question on DevX was concerning the ResUsageSPS table (Chapter 11 in the ResUsage Manual), which covers workload-level statistics.  Here is how the DevX thread started off:

"Recently, I have noticed the CPUTime Column is not available on 14.10 ResUsageSPS table (which was there in 13.10 SPS Table).  Please have a glance on my understanding is accurate or not on below aspects…"

The ResUsageSPS table uses milliseconds for CPUUServ and CPUUExec, both in 14.0 and 14.10.

As I said in my blog response, all CPU metrics in the ResUsageSPS table are in milliseconds, both in 14.10 and prior to 14.10.  So since there is no change in how CPU metrics are expressed in the ResUsageSPS table going to 14.10, you do not need to do anything differently when you manipulate those columns when you get on 14.10.

The views on the SPS table in both 14.0 and 14.10 use the identical calculations for CPUPct field, so I don't see any differences there either.

Thanks, -Carrie