We have TASM setup to demote user queries from USERS-SHORT to USERS-MEDIUM and then USERS-LONG, after some CPU seconds consumed threshold. We are seeing a number of queries get sent to our USERS_SHORT workload, only to be demoted to USERS_MEDIUM and then USERS_LONG. By a number I mean around 150 queries an hour. Since these are end users and we can control what they write, I don’t know if it makes sense to adjust/raise the thresholds that dictate demotion and let them live in the higher priority workloads longer or not. Some of my concerns are:
Ø Is this demotion of so many queries taxing on the system?
Ø Does it make more sense to leave the threshold of USERS_SHORT at 2 CPU seconds and therefore preserve that slice for the truly short queries?
Ø Does our version of the database (V2R13.10.01.04b) have an issue with determining accurately what the query is going to use and therefore starting them in the wrong workload to begin with?
I remember reading that if you have a lot of demotion going on, you need to go back and tweak your WDs but I’m not sure what that would do, if the optimizer is the who decides what WD a query is a given.
Thanks for any insight you can offer.
One of the biggest issues with queries being demoted in this way is that queries are jumping the queue by starting in short. If lots of queries do this then you may end up with more queries running on the system than you want, and queries that are assigned to the long workload in the first place may not ever end up running, as the number of active queries in the workload may all be from demoted queries.
It sounds like the estimates for these queries are very inaccurate if TASM thinks they are under 2 cpu seconds and they end up in the long queue. Check the TDWM estimate columns in DBQL to compare the estimated resource with the actual resource used. If these are vastly different run an explain on the query and see if it looks sensible. You can also use rewind in viewpoint to compare estimated rows with actual rows in each step.
I suspect that statistics may be out of date, not collected, or worst of all collected on empty tables, making the estimates inaccurate.
I've seen a very busy system before where collecting stats made a query run take longer to run. Not because it made the query less efficient, but because TASM had a better estimate for the query so allocated it to the correct workload where it was queued. Before the stats were collected the query was less efficient, but as it ran immediately then to the user it completed in less time and therefore seemed better.
Thanks for the quick feedback. We have regular stats collection jobs in place and we validate the stats once a week to make sure they are all current. I'll check these specific queries to see if they are hitting tables that may not be picked up by our regular stats collection jobs.
We have a let's say "challenging" workload. Andy know's as we both worked on the same datawarehouse. As Andy says statistics are imperative and collection schedule is too. Collect only if 10% of the rows have changed, if you collect too many times then this will impact on performance. Collect too infequently or not at all then you will get a drift in how many queries are reclassfied and eventually if the statistics are so far out the Optimiser could pick a wholly inappropriate execution plan and consequently be classified incorrectly. As you say the exception criteria have to carefully considered so as not to interfer with correctly classified work in a lower priority workload.
I have done extensive analysis on TDWM estimates, classification and recleassification and have been able to reset the exception rules to demote based upon this data however it is constant challenge to get users to recollect stats at appropriate times.
Happy to share findings
I add my question to the existing post (by now in 2016 and working on V15.00 under SLES11).
The question of demotion and impact on system has still to be updated and enlighted:
We have an active ruleset with 15 workloads definitions and some exceptions especially dedicated to demotion of Workloads (all in Timeshare level).
Let's say "WD-Normal" being demoted in "WD-Demoted" when estimation of CPU is over 12 seconds,
and "WD-Demoted" being demoted in "WD-Low priority" when estimation of CPU is over 120 seconds.
The trick is that no session is ever opened in workload "WD-Demoted", due to the evaluation order of workloads.
So a "WD-Normal" session can turn first in "WD-Demoted" and next in "WD-Low priority", according to the Estimated Processing Time.
But the real consumption of CPU (AMPCPUTime) is often under estimation, so we observe demoted queries that should have kept their original workload.
Does that demotion can impact system and how ?
What about that "transition WD-Demoted" workload, fullfilled with demoted "WD-Normal" demoted queries only ?
Thanks for recommendations,