Windows CPU monitoring with Splunk

I needed to overcome the Nagios limitation of a slow polling interval for CPU utilization. Splunk provides robust tools for leveraging WMI to monitor windows performance. So you think to yourself: “Great, lets look at CPU usage per process on important servers”, and you bump right up into the catch. Windows perf counters come in two flavors: “raw” and “formatted”, and Splunk only reads the formatted ones. This is fine for most things, but for CPU usage, the data isn’t accurate on multi-core systems. This post covers how to use the raw data, and “cook” the values to match what you’re used to seeing in task manager.

 

CPU utilization on muti-core systems

CPU utilization is measured by % Processor Time over a set period. A mutli-threaded application makes calculating utilization on multi-core systems not as simple 0 to 100 scale. During a time slice, several different threads may have had processor time on a core, or more likely, different threads will have had different utilization on different cores. When calculating CPU usage on a Windows system, the maximum is actually 100% x # of cores.

 

2016-01-06 20_09_15-silo - Remote Desktop Connection

Per core CPU usage average

The formatted CPU usage per process counter doesn’t take this into effect, and the results will not be consistent with what you see in performance monitor or task manager.

 

Cooking the values

So what you actually have to do is take two samples of CPU utilization, determine the elapsed time between them, and calculate the difference. Thanks to a very detailed blog post by Greg Stemp at Microsoft, I worked out that the counter type is PERF_100NSEC_TIMER making the formula:

100* (CounterValue2 - CounterValue1) / (TimeValue2 - TimeValue1)

I’ll get into applying that formula in Splunk in a minute, but first, getting the data in.

 

Get the data in

Like I said, Splunk has great first class Windows support, and has built in support for WMI inputs. It also has support for monitoring performance counters automatically, but since they don’t support RAW values, we’re going to use WMI. So, here it is:

[WMI:process]
index = test
interval = 10
server = RABBIT
wql = Select IDProcess,Name,PercentProcessorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process

The file must go in wmi.conf, it’ll appear in the GUI as a remote performance monitor.

Querying the data

And at long last, show me the chart! First off, a long Splunk query that is going to run off the end of the page (don’t worry, we’ll unpack it and go over it):

earliest=-3m index=test Name!=_Total Name!=Idle | reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name | eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS) | search cputime > 0 AND cputime < 400 |  timechart span=3 avg(cputime) by Name

The pieces explained:

Name!=_Total Name!=Idle
  • _Total and Idle make the time chart look bad, and can be more easily added with an | addtotals
reverse
  • We need the results from oldest to newest, because the calculations are based on a “previous value”
streamstats current=f
last(PercentProcessorTime) as last_PercentProcessorTime
last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS
by Name
  • Streamstats pulls values from the previous record forward. We don’t need the current value, so we drop it.
  • last(PercentProcessorTime) and last(Timestamp_Sys100NS) pull the previous values forward and alias them
  • by Name groups the events together, so that you’re getting the “right” previous PercentProcessorTime. Without it, you would be getting values from some previous process
eval cputime = 100 *
(PercentProcessorTime – last_PercentProcessorTime) /
(Timestamp_Sys100NS – last_Timestamp_Sys100NS)
  • This is the math. It works out the CPU usage between the slices, and the time between the slices
search cputime > 0 AND cputime < 400
  • Throw out anomalous values. * 400 is 100 x the number of cores on this box. Adjust yours accordingly.
timechart span=3 avg(cputime) by Name
  • Make it pretty!

* PercentProcessorTime increments forever, if you plot it on a chart it rises over time. When running the math on a single process, it’s fine, the next value is always higher than the last. There are issues around the transitions though. If the last value of PercentProcessorTime for ProcessA is 40000, and the first value for ProcessB is 10000, then the delta is 30000 and it throws the graphs way out of whack. The easiest thing to do is just drop those values. We’re going to pipe the values through an avg() in timechart anyway, so little gaps like that will get smoothed out.

The results

I used a little app called CpuKiller to throttle the CPUs and make the results more obvious.

 

2016-01-06 15_39_39-rabbit - Remote Desktop Connection

Windows performance monitor Green – CPU killer Red – VMWare

 

2016-01-06 15_39_53-Photos

The same data plotted in Splunk. Note that Splunkd doesn’t appear in the perfmon graph because I didn’t include that process, but the WQL being run by Splunk watches all processes.

 

Some of the fun things you can do with this are look at processes from a number of servers across an entire farm (for example, your web servers), or plot application response times against resource utilization. Yay for Splunk!