Windows CPU monitoring with Splunk
January 6, 2016 Leave a comment
I needed to overcome the Nagios limitation of a slow polling interval for CPU utilization. Splunk provides robust tools for leveraging WMI to monitor windows performance. So you think to yourself: “Great, lets look at CPU usage per process on important servers”, and you bump right up into the catch. Windows perf counters come in two flavors: “raw” and “formatted”, and Splunk only reads the formatted ones. This is fine for most things, but for CPU usage, the data isn’t accurate on multi-core systems. This post covers how to use the raw data, and “cook” the values to match what you’re used to seeing in task manager.
CPU utilization on muti-core systems
CPU utilization is measured by % Processor Time over a set period. A mutli-threaded application makes calculating utilization on multi-core systems not as simple 0 to 100 scale. During a time slice, several different threads may have had processor time on a core, or more likely, different threads will have had different utilization on different cores. When calculating CPU usage on a Windows system, the maximum is actually 100% x # of cores.

Per core CPU usage average
The formatted CPU usage per process counter doesn’t take this into effect, and the results will not be consistent with what you see in performance monitor or task manager.
Cooking the values
So what you actually have to do is take two samples of CPU utilization, determine the elapsed time between them, and calculate the difference. Thanks to a very detailed blog post by Greg Stemp at Microsoft, I worked out that the counter type is PERF_100NSEC_TIMER making the formula:
100* (CounterValue2 - CounterValue1) / (TimeValue2 - TimeValue1)
I’ll get into applying that formula in Splunk in a minute, but first, getting the data in.
Get the data in
Like I said, Splunk has great first class Windows support, and has built in support for WMI inputs. It also has support for monitoring performance counters automatically, but since they don’t support RAW values, we’re going to use WMI. So, here it is:
[WMI:process] index = test interval = 10 server = RABBIT wql = Select IDProcess,Name,PercentProcessorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process
The file must go in wmi.conf, it’ll appear in the GUI as a remote performance monitor.
Querying the data
And at long last, show me the chart! First off, a long Splunk query that is going to run off the end of the page (don’t worry, we’ll unpack it and go over it):
earliest=-3m index=test Name!=_Total Name!=Idle | reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name | eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS) | search cputime > 0 AND cputime < 400 | timechart span=3 avg(cputime) by Name
The pieces explained:
Name!=_Total Name!=Idle |
|
reverse |
|
streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name |
|
eval cputime = 100 * (PercentProcessorTime – last_PercentProcessorTime) / (Timestamp_Sys100NS – last_Timestamp_Sys100NS) |
|
search cputime > 0 AND cputime < 400 |
|
timechart span=3 avg(cputime) by Name |
|
* PercentProcessorTime increments forever, if you plot it on a chart it rises over time. When running the math on a single process, it’s fine, the next value is always higher than the last. There are issues around the transitions though. If the last value of PercentProcessorTime for ProcessA is 40000, and the first value for ProcessB is 10000, then the delta is 30000 and it throws the graphs way out of whack. The easiest thing to do is just drop those values. We’re going to pipe the values through an avg() in timechart anyway, so little gaps like that will get smoothed out.
The results
I used a little app called CpuKiller to throttle the CPUs and make the results more obvious.

The same data plotted in Splunk. Note that Splunkd doesn’t appear in the perfmon graph because I didn’t include that process, but the WQL being run by Splunk watches all processes.
Some of the fun things you can do with this are look at processes from a number of servers across an entire farm (for example, your web servers), or plot application response times against resource utilization. Yay for Splunk!
Recent Comments