You see. I kind of understood the following chart on MP-RAS. This is the FreeMem tab from the "Higa Charts" on an actual MP-RAS system on which I did some performance work. (Hopefully, everyone is familiar with the Higa Charts that you can create using the views, macros and excel chart created by Larry Higa. If not, let me know if I should write a blog about creating these charts to help get you started using these powerful tools.)
When I analyzed this chart, what I concluded was that this customer was occasionally running low on memory. As a general rule, what we said on an MP-RAS system is that if your Minimum Free Memory (the black sections of this graph) frequently fell below 50 MB that it was time to start digging in and finding out what is going on, and what we should do about it. The following graph frequently falls below 50 and sometimes approaches or hits zero.
The thought that memory was running low was corroborated with the Paging & Memory Allocation Fails graph. The large spikes in Mem Fails is another indication that we may be having memory issues. Again, this is on MP-RAS.
At this customer, we dug into the periods of time where the Freemem dropped and the Mem Fails spiked and found that these were when large number of load jobs were hitting the system all at once, and some of these were not specifying a Sessions parameter. Without a Sessions parameter, load jobs will default to one session per Amp.
Supplying a Sessions parameter, and staggering the start of their load jobs helped improve the memory situation.
In other cases, the appropriate action may be to contact your CSR to consider a reduction in FSG Cache to give a little more memory back to the system.
At a high level, that is how I understood the process of looking for potential memory issues on MP-RAS.
Then, I ran the Higa charts on a Linux system. The following chart shows Free Memory. At this particular customer, Free Mem never fell below 11 GB! So, the Free Mem chart, on Linux, is no longer useful for determining if we are having memory issues. Linux manages memory differently than MP-RAS, and the measurement of Free Memory makes this very obvious.
So, we need to focus on Paging information. The following chart is one of the new ones from the TD12 version of the Higa Charts, showing Data Paging for a Linux system. This charts the number of 4K blocks that were swapped out per second (Ctxt PgWrts/Sec) and the number of 4K blocks swapped in per second (Ctxt PgRds/Sec). Excessive swapping indicates we may be running low on available free memory, forcing the system to be swapping pages out and back in over and over again.
The above chart shows there were certain times of day that memory was at a premium, and further investigation is necessary.
If you still aren't running Higa Charts, the following query provides similar information.
LOCKING dbc.resusagespma FOR ACCESS
,TheTime (FORMAT '99:99:99')
,MemCtxtPageReads/secs (named "pswpin/s")
,MemCtxtPageWrites/secs (named "pswpout/s")
,(MemCtxtPageReads+MemCtxtPageWrites)/secs (named "TotSwap/s")
WHERE TheDate GT DATE-7
ORDER By TheDate,TheTime,NodeId;
Ideally, having no swapping is best, of course. But, when do you get concerned? The best information I have says that if you are exceeding 10 page swaps per second, you probably have an issue worthy of addressing for that period of time. Even that needs to be taken with a grain of salt to a degree. If it is happening at 2AM when your batch jobs kick off, and you aren't seeing any negative impact, perhaps it isn't a concern. But, if it happens Monday morning at 9AM at your busiest time of day, then it is probably more of an issue.
There is also a Chart for Code Paging In, as seen below. At the time of this writing I do not have a good rule of thumb for when this becomes a concern but I will post it later as I find this out.
Non-FSG Cache Memory Managed by the Operating System is Used For:
• Derived table optimization
• VPROC Tasks
• AMP Worker tasks
• PE Tasks
• Dictionary cache (PE)
• Request text (PE)
• Step text (PE)
• Space Accounting Cache (PE)
• Database Query logging (PE)
• Plastic/concrete step memory (PE)
• Session Info (PE)
• Hash join memory
• Row redistribution buffers
• Aggregation buffers
• Bynet/BNS page pool
• Other BNS memory (RCBs)
• All OS-level memory use
• Kernel Interrupt page pool
• Swapping -- Code and Data pages moved in and out of memory must wait for pages to be swapped back in before they can execute. In extreme cases, the entire system can be bottlenecked on swap I/Os to the root disk. This can cause an impact to query performance, and tactical queries are extremely sensitive to this problem
• In extreme cases Bynet failures, restarts, system "pauses" can occur
First, I would strongly recommend you look for load utilities that are coming in all at once grabbing too many sessions. Sometimes people forget the SESSIONS parameter and grab too many sessions. Other times I have seen jobs fired off based on the time of day, and they were all lumped together at exactly the same start time. In that case we were able to stagger the start times and eliminate a low-memory situation.
Reducing FSGCache is the primary method of freeing up memory. If you are seeing evidence of a possible shortage of Free Memory, you should start discussions with your CSR. Get input from CS about the need to reduce it, and where other customers are finding the sweet spot to be for FSGCache given the size of your system, the operating system you are using, and many other factors. It is impossible to provide an exact number without knowing more details about each individual system.
Please don't hesitate to post comments, thoughts and clarifications on this. And, as I get more information I will do the same.
And, I appreciate any other topic ideas you may have for me to research and write up.
This article was co-authored by Tom Greene. Thanks Tom!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.