We have a bit of a debate on the best way to deal with skewed queries. On the one hand there is the arguement for putting them into a 'penalty' partition in priority scheduler so that there CPU is restricted. In this way they run for ever but supposedly don't consume CPU or affect other queries.
The second opinion is that we should just cancel them.
Clearly there are arguements on both sides but what frustrates me is that nobody seems to ever be able to explain and justify their standpoint.
From my experience the first method doesn't work. even if logically it sounds like it should. Can anyone shed any more light on this area.
Assigning skewed queries to an allocation group that has a very small CPU limit (or ABSolute scheduling policy if you are on an old release) will limit the impact on other work.
I don't think there is a black and white answer on this. I consider the "penalty" AG mainly a way to buy time so you can investigate further.
You have to be careful about canceling queries that appear to be skewed. For one thing, once a problem query starts overconsuming resources on an AMP or AMPs, you often have other queries that "fall behind" on those AMP(s); even after the problem query is out of the system, it may take a while for those others to "catch up" - so if you are looking at PM/API snapshots (e.g. PMON), many queries may appear skewed for some period of time despite having balanced total resource usage. There is also the question of how valuable the answer may be, and whether or not figuring out the issue, fixing the skew, and re-running will really return the answer any faster than just allowing it to finish running. On the other hand, whenever you make a query run longer (by limiting its resource consumption), it ties up resources longer.
Maybe you let the original query run in the "penalty" AG until you have successfully run a "fixed" version - at which point you shouldn't get many complaints about canceling the original one.
In this particular instance I'm really talking about large skewed queries rather than small ones and for the sake of discussion I'm ignoring the business value aspect of the query. Clearly there are times when to get the answer you want you end up with a skewed query.
I am using PMON to monitor for large and skewed queries. We know in this case what they look like, and when we see them come in against an already saturated box we see a big degradation in performance. It usually takes 30 odd minutes for the situation to return to normal whether cancelling or penalising the query.
Does penalising the query at say 1% CPU, really free up the amp, or does it hold onto any resource that will affect the throughput of other queries. IE 1% would suggest 1 out of every CPU cycles. My anecdotal evidence would be that we see it 'using' more resource than this.
The query will hold AMP Worker Tasks, memory, locks, etc. and in a saturated system that can make a difference. But if it takes that long to re-stabilize, you may need to review your overall workload management strategy. The manuals and Orange Books have some good tips. For example, limiting concurrency for some workloads may "smooth out" resource usage and improve overall throughput.