Controlling the Flow of Work in Teradata

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

When you push too much paper into your shredder, it jams; when you pour too much coffee into your cup, it overflows; when you eat too much food, you…well, you know. Everything’s got limits.

One thing that grabbed my attention when I first joined Teradata back in '88, and that I still find striking today, is how the database manages the level of work that enters the system. Managing the flow of work inside Teradata is decentralized, low-effort, and scalable, exhibiting control at the lowest possible level—at the AMP.

Each AMP, Teradata’s parallel unit, keeps track of how much work it has accepted, how much work it has active, and how much work is queued up waiting to run. When specific internal limits are reached on a given AMP, that AMP will close the door to accepting new work, giving the AMP time to catch its breath and work off the work it already has accepted.

When an AMP temporarily closes the door to new work, that AMP is in a state that we call “flow control.” When in a state of flow control, which often lasts a fraction of second, that AMP will turn away newly-arriving messages. Arriving work can be user work that the dispatcher has sent to the AMP for processing. Arriving work could also be rows being redistributed from other AMPs on behalf of a user query, or internal database work that has to be done.

Consider the example of user work arriving from the dispatcher. This work-to-be-done is in the form of a message that usually represents a single optimized query step. If the step is an all-AMP step then the message is sent to all AMPs. When the message arrives on an AMP it attempts to acquire an AMP worker task (AWT), in order to begin executing on the AMP. The work message contains all the context of what needs to be done in the AMP, and information about what the data sources are, but the message needs an AMP worker task, which acts as the engine to do the actual work.

Some AMPs may have available AWTs and when the query step arrives on those AMPs it can begin executing immediately. Other AMPs may not have an AWT available because all of their AWTs are busy servicing other work. It is acceptable in Teradata for some AMPs to begin working on a step while other AMPs queue up the message so that it can be worked on when an AWT on the that AMP becomes available. Unnecessary message-passing and coordination between AMPs is eliminated using this approach.

The following graphic illustrates that each AMP is working on its portion of a query step independently from the other AMPs.   AMP 0 and AMP 2 have AWTs available and begin work immediately at Time 1.   At Time 2, AMP 1 is able to get an AWT for this work and starts working, even though by that time AMP 2 has completed its work and released its AWT.

In the case where the message has to be queued up and waits for an AWT, this wait can be very short or not so short, based on how long the work currently running (and tying up AWTs) takes to complete. Each AMP has its own message queue to hold these waiting messages during the times when all their AWTs are busy.

The message queue on each AMP has a limit of how many messages it can hold for each work type. Having such a limit helps to keep the memory requirements of an AMP at a reasonable level. A “work type” is a category of work message having to do with the importance of the arriving work in completing work already underway. One work type, for example, is WorkNew, which represents work messages that come directly from the dispatcher, usually representing new user-initiated work.   New work coming into an AMP is considered the less important than other work types because it does not contribute to completing already-started work, rather it starts something new.

Each work type has its own flow control gates that keep track of how many AMP worker tasks are in-use supporting that category of work. These flow control gates also keep a count of how many messages of that work type are waiting in the message queue on that AMP.

When the specified limit of messages on the message queue for an AMP is reached for a given work type, it’s flow control gate temporarily closes and any additional messages of that work type arriving on the AMP are not accepted.

The dispatcher knows when messages are no longer able to be fully accepted by all AMPs, and will retry such a message, multiple times if necessary, until there is an open place on the message queue of each AMP to receive it. If even one AMP is unable to accept an all-AMP message (either by providing an AWT or by queuing the message), then no AMP is allowed to accept that message. 

The graphic below shows a situation where a single AMP has its flow control gate closed for WorkNew messages, causing a new incoming message to be turned away.  That incoming message will be retried periodically by the dispatcher.

Having one or more AMPs in the state of flow control and having to retry messages is usually considered a sign of congestion, and there are recommended actions that can be taken to get AMPs back to a more normal state, and keep them there.  However, being in this state of flow control is just the high-end of a graceful, efficient, highly-integrated approach to managing the peaks and valleys of work that gets thrown at a typical Teradata data warehouse. Managing flow control in this manner is certainly a better approach than simply overflowing like my coffee cup, or jamming like my shredder.

25 Comments
Enthusiast
Nice article indeed...
to add further , alerts for AMP Worker Tasks generates from TDManager , eg -
Threshold Value: 5
Actual Value: 2
Meaning that the lowest threshold value for an AMP is 5 available AWT where as at present it has gone down to 2.
I'm I right in assuming that database is having lot many users throwing their queries and CPU/AMP usage is quite high at that time?
Or is it be due to skewness of data where a particular AMP is working extra and resulting in AWT alert?

Teradata Employee
If I am understanding you correctly, you are saying is that by the time you get an alert from Teradata Manager that your AMP worker task availability counts have reached some low threshold on at least one AMP (such as 5 in your case) that the actual AWT availability has gone down to 2. That could happen because Teradata Manager alert is triggered by a point-in-time event (only 5 AWTs remaining on one AMP). AWT in-use counts can change very quickly, so within less than 1 second from when that threshold was met there could be more AWTs being used. Teradata Manager only does an alert when the threshold is reached, it does not stop the available AWTs from going down even more.

CPU usage is usually higher when a higher number of AMP worker tasks are being used, because more work is active, but that is no always the case. The only way to know is to look at the CPU monitoring tools.

You are correct that skewed processing can be responsible for higher AWT in-use counts on some AMPs. That is often the case when a single AMP has more work to do on one query step. The skewed AMP will hold AWTs for the skewed query longer than other AMPs, and because of the higher demand for resources, will be slower to complete the non-skewed work, so that work may also holds its AWTs longer on that AMP. higher-than-average AWT in-use count. DBQL can be helpful in identifying which query was responsible for skewed IO or CPU consumption.
Hi Carrie,
Thanks for the useful and Interesting article on AWT. Couple of days ago I was running a query that returned 70 Million rows in the result set. I got the error message saying "Exception criteria exceeded: CPU Time, CPU Disk Ratio". After an hour i have kicked off the same query and got the result in 20 minutes.Not sure if the error message was caused by AWT. Need to do some research on DBQL way to go :).
Teradata Employee
Terasuda,

Both CPU time and CPU Disk Ratio are exceptions that can be defined on a workload in TASM. It would appear that your query exceeded one of those exceptions, which is why you got the error message back.

The CPU Time exception is based on how many CPU seconds have been consumed by the query. This exception is not caused by AMP worker tasks, or being short of AWTs.

The second exception, CPU-to-Disk Ratio, is determined by the number of CPU milliseconds that are being consumed for each I/O. It's intent is to detect queries that are extremely CPU-intensive, such as an unconstrained product join, by detecting how much CPU is consumed for each I/O. It is not caused by AWT exhaustion either. On the face of it, I don't see any direct relationship with getting the error you described and AWT levels.

The most common symptom of being out of AWTs from the query perspective is for a query to take a longer time than expected to complete. It takes longer because one or more of the query steps cannot get an AWT when that step is sent to the AMPS, and so the message representing that step has to wait in the message queue on one or more AMPs. This wait could be subsecond, a few seconds, or a few, or many minutes. This wait in the message queue lengthens the query's elapsed time, but doesn't impact the amount of CPU seconds consumed by the query, or its CPU-to-disk ratio.
Carrie, Thanks for the clarification
Enthusiast
Hi,

Thanks for providing such a great article to understand about AWTs. I have some douts, please help me resolve the same

-> Let's say i have 4 AMP system and Amp no 4 in the flow control situation and it shut down it's door to accept new request. now at that time , there is ALL amp request comes so in situaltion, wether amps 0,1,2 will accept the request and only amp 3 will reject the request or all amps will reject the request?

-> Also, How this works in the case of receiver Spawn work type(received by another AMP) who is in the flowcontrol state. so will both receiver and sender will queue up?

Regards
Pinal Patel

->
Teradata Employee
Hi Pinal,

In order to address your comments, I have put out a new blog posting that addresses flow control for spawned messages and how that all works.

Please see my latest blog posting, and hopefully that will answer your questions. It is titled Flow Control for Spawned Work.

Thanks, -Carrie
Enthusiast
Hi Carrie,

Could you please suggest any article written so far regarding CPU utilization of a query? I would like to know what is the unit of CPU consumption and what is its significance in terms of performance of query and all. Not sure, if it's the correct platform to ask but my desperation know aforementioned things made me post this :-).

Thanks!
Teradata Employee
CPU seconds consumed by a query is reported in DBQL (Database Query Log). It is reported in both AMP CPU seconds and Parser CPU seconds. See the DBQL chapter in the Database Administration manual for details on using and interpreting information in DBQL.

CPU usage is a contributing factor in the total elapsed time of a query, and therefore its performance. The more CPU consumed by the query, the longer it will tend to run (depending on its priority, current platform busy-ness etc.).

The Performance Management manual has an entire chapter on SQL and performance, and goes into quite a bit of detail about things that can use result in a query using more CPU.

Thanks, -Carrie
Enthusiast
Hi Carrie,

In order to identify whether the messages are waiting on the vproc or not, we would like to enable the MessageCount metric to identify the flow control. But i want to understand the way to set the number for that alert based on the message count for different systems.
We have DEV(4 node), TEST (4 node), Prod(8+8 (active and standby)), PreProd (9+9(active and standby))
For ex: If the system detects MESSAGE COUNT>100 on Prod then the DBA will receive an alert on Flow control. But i dont think so it is the right number for DEV and TEST.
Can you please share your thoughts on this?

Thanks,
GeethaReddy
Teradata Employee
Geetha,

Yes, you can define alert thresholds differently for different systems. Viewpoint will handle that. But I think your question is what threshold to use for different systems.

First, alerting when the message count reaches 100 in your production system seems like a high threshold to me. At that point I would expect the system to be congested and not working optimally. It is better to catch AMP worker task exhaustion when it begins, when there are a low number of messages waiting, like 10 or 15. It is even better to be alerted when AMP worker tasks have not yet been exhausted but are getting close.

There is no advantage and a number of disadvantages in tolerating even a small degree of flow control, so I would suggest you lower that number on all platforms. If you are using TASM, you might consider using the AMP worker task event to take action, or alert the DBA, when AWTs are close to gone on one AMP. With the AWT Event you can define a qualification time that will prevent flow control of short duration from contributing to an event. And with TASM events you can automate the change of workload management settings if you want, to reduce demand of new work temporarily until the system stablizes.

I can't advise you on where to set the threshold for a test or a dev system. That would depend how congested you want the system to get before you take action, and how much throughput you are willing to sacrifice in the process. Like I said earlier, it is better on all systems to catch the condition of running out of AWTs before it turns into flow control, and 100 messages waiting for an AWT on a 8+8 platform means that at least one AMP is already deep into flow control before you get alerted.

If you prefer using the Viewpoint alerts, I would set the threshold in the single digits on all platforms. If the alert gets triggered too often for you, then instead of changing the alert threshold, consider adding concurrency control mechanisms (such as throttles), or if they exist already, make the query limits lower so less work is introduced at the same time.

Thanks, -Carrie
Enthusiast
Great, thanks for your response Carrie. I will let you know the on this further.
Enthusiast
Thanks again Carrie for a great explanation of a confusing topic! I read a post somewhere along the way by someone who said he was using Viewpoint to generate reports on AWT usage and then reviewing those reports weekly to determine their tipping point. Do you know what/how I could use Viewpoint to generate such reports?
Thanks again,
Mike
Teradata Employee
Hi Mike,

In Viewpoint, I've only ever used the Metrics Analysis portlet and the Capacity Heatmap for viewing AWT levels. I am not familiar with any reports being generated from Viewpoint.

Try posting the question on Gary Ryback's blog, as he is one of the more knowledgable in the community about all things Viewpoint.

Thanks, -Carrie
Teradata Employee
The trend portlets (Capacity Heatmap and Metrics Graph) under the System metrics choice will plot out "AWT" (Average number of AMP worker tasks in use on each AMP) as well as "CPU AWT Processing". The Metrics Analysis portlet also allows display on these metrics. That's what you have for the metrics portlets. Node Resources provides a lot of information on AWTs too.
Enthusiast

hi Carrie/Gary,

I was looking at the metrics analysis portlet for wait IO. I can see AVG, MAX & MIN over there, I think i should consider the "AVG Wait CPU" as the right one instead of taking "MAX Wait CPU". I am trying to understand to match the VP display with SAR command output during that period. Please correct me if I am wrong.

And Gary, may be this question will fall under your expertise,

When i tried to export the data for one particular range, for ex: 07-19-2013 to 07-20-2013, it is exporting the data in to the .csv file, but when i open the file i can see the data for whole period from when we enabled the Viewpoint.

Thank you,

Geeta.

Teradata Employee

Geeta,

You're more likely to get a response if you post Viewpoint-related questions somewhere people knowedgeable about Viewpoint will see them.  I don't think Gary reads the comments on my blog postings down at this level.  

I'd suggest you re-post these questions on the Teradata forum, or on a blog that focuses on Viewpoint functionalities.  

Thanks, -Carrie

Enthusiast

I will do that Carrie, thank you.

Geeta.

Hi Carrie

Just curious to know, if the message queue lenght differ from site to site.

if yes, then what is the recommended limit for the message queue and how does the queue length affects the performance of any specific query or other running queries waiting for next available AWT.Can you help me understand considering messageOne queue in this case.

Thanks

Sumeet Shekhar

Teradata Employee

Sumeet,

Work message queue length is not considered a tunable.  The queue length may be different from one configuration to another and one site to another.   That difference is based on the number of nodes. 

For configurations with fewer than 16 nodes, the message queue length for each work type is 20; otherwise it is the number of nodes + 5.  The message queue length limit is for each work type individually, and few-AMP messages have their own flow control gates compared to all-AMP messages of the same work type. 

For example, if you have 20 nodes, the message queue length limit for all-AMP messages of the Work00 worktype will be 25.  The same limit of 25 would be in place for all-AMP messages of the Work01 work type.

I recommend sticking with the default queue length settings.  If you believe they should be something different, you would need to call the support center and ask for their help and advice.

A longer message queue can result in an AMP going into flow control less quickly because there will be more room on the queue for more messages before messages start to be retried.  On the other hand, longer message queues use more memory. And no matter what the message queue length, if you are out of AMP worker tasks, queries will be degraded anyway because they have to wait longer to get one and start work, either in the queue or doing retry logic.  

The workload management focus is better placed on preventing the exhaustion of AMP worker tasks so you never fill up the message queue.  Use the message queue as a safety valve should you ever need it, not as a regularly used staging area that often fills up.

Thanks, -Carrie 

Enthusiast

Hi Carrie,

Thanks for the nice article.

I have a question about Message Queue length.

You stated that Message queue length will be 20 for nodes <16 but I have seen more than 20 messages waiting on mailbox 2-11 using puma command.

How can that be possible ?

Thanks,

Vipin Gupta

Teradata Employee

Vipin,

One of the points I tried to make in the posting above had to do with the limit of 20 messages being associated to the work type:  "The message queue on each AMP has a limit of how many messages it can hold for each work type."

If user-initiated work is using 2 or 3 different work types (say worknew, work01, and work02), then it is possible to have up to 20 messages queued for each of those work types (up to 60 in combination).  In addition, it is also possible for some of the internal work types (work12, work13, for example) to queue up messages on the message queue as well.  So you could hypothetically have (20 x  number of work types) messages queued up on a single AMP.

Thanks, -Carrie 

Enthusiast

Thank you Carrie for clarifying my doubt. I think I overlooked "each worktype" phrase in explaination.

Enthusiast

Hello Carrie,

Thanks for such a nice article.

What is right balance between WorkOne and WorkNew? Is there any specific trend that we do expect? I have noticed that Work One is getting lot less AWT as compared to Work New. Is it expected?

Thanks,

Teradata Employee

There is usually more activity with WorkNew than with WorkOne AWTs.  Every request-step that executes requires a WorkNew AWT just to get started.   And if there are two or three parallel steps, each doing simple work like a table scan, a request could be using two or three WorkNew AWTs at the same time and no WorkOne AWTs. 

Only the steps that involve row redistribution, global aggregation or similar spawned work, require a WorkOne AWT.  That is usually a subset of the steps that require a WorkNew AWT.

It sounds like what you are seeing is normal behavior.  WorkNew inuse counts can be quite a bit higher than WorkOne's, it really depends on the kind of work that is running at the time.  In fact that is why there is a limit of 50 on how many WorkNew AWTs can be in use at the same time:  to prevent really high demand for WorkNew from taking almost all the AWTs and leaving very few for WorkOne.  So I wouldn't be concerned about your situation, as it sounds normal to me.   If you are hitting the limit of 50 on WorkNew that leaves 12 AWTs for WorkOne (9 in the unreserved pool and 3 in the WorkOne reserve pool).   I've seen that 50 WorkNew-to-12 WorkOne ratio show up from time to time and would consider it normal behavior.

Thanks, -Carrie