Windows CPU monitoring with Splunk

I needed to overcome the Nagios limitation of a slow polling interval for CPU utilization. Splunk provides robust tools for leveraging WMI to monitor windows performance. So you think to yourself: “Great, lets look at CPU usage per process on important servers”, and you bump right up into the catch. Windows perf counters come in two flavors: “raw” and “formatted”, and Splunk only reads the formatted ones. This is fine for most things, but for CPU usage, the data isn’t accurate on multi-core systems. This post covers how to use the raw data, and “cook” the values to match what you’re used to seeing in task manager.

 

CPU utilization on muti-core systems

CPU utilization is measured by % Processor Time over a set period. A mutli-threaded application makes calculating utilization on multi-core systems not as simple 0 to 100 scale. During a time slice, several different threads may have had processor time on a core, or more likely, different threads will have had different utilization on different cores. When calculating CPU usage on a Windows system, the maximum is actually 100% x # of cores.

 

2016-01-06 20_09_15-silo - Remote Desktop Connection

Per core CPU usage average

The formatted CPU usage per process counter doesn’t take this into effect, and the results will not be consistent with what you see in performance monitor or task manager.

 

Cooking the values

So what you actually have to do is take two samples of CPU utilization, determine the elapsed time between them, and calculate the difference. Thanks to a very detailed blog post by Greg Stemp at Microsoft, I worked out that the counter type is PERF_100NSEC_TIMER making the formula:

100* (CounterValue2 - CounterValue1) / (TimeValue2 - TimeValue1)

I’ll get into applying that formula in Splunk in a minute, but first, getting the data in.

 

Get the data in

Like I said, Splunk has great first class Windows support, and has built in support for WMI inputs. It also has support for monitoring performance counters automatically, but since they don’t support RAW values, we’re going to use WMI. So, here it is:

[WMI:process]
index = test
interval = 10
server = RABBIT
wql = Select IDProcess,Name,PercentProcessorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process

The file must go in wmi.conf, it’ll appear in the GUI as a remote performance monitor.

Querying the data

And at long last, show me the chart! First off, a long Splunk query that is going to run off the end of the page (don’t worry, we’ll unpack it and go over it):

earliest=-3m index=test Name!=_Total Name!=Idle | reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name | eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS) | search cputime > 0 AND cputime < 400 |  timechart span=3 avg(cputime) by Name

The pieces explained:

Name!=_Total Name!=Idle
  • _Total and Idle make the time chart look bad, and can be more easily added with an | addtotals
reverse
  • We need the results from oldest to newest, because the calculations are based on a “previous value”
streamstats current=f
last(PercentProcessorTime) as last_PercentProcessorTime
last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS
by Name
  • Streamstats pulls values from the previous record forward. We don’t need the current value, so we drop it.
  • last(PercentProcessorTime) and last(Timestamp_Sys100NS) pull the previous values forward and alias them
  • by Name groups the events together, so that you’re getting the “right” previous PercentProcessorTime. Without it, you would be getting values from some previous process
eval cputime = 100 *
(PercentProcessorTime – last_PercentProcessorTime) /
(Timestamp_Sys100NS – last_Timestamp_Sys100NS)
  • This is the math. It works out the CPU usage between the slices, and the time between the slices
search cputime > 0 AND cputime < 400
  • Throw out anomalous values. * 400 is 100 x the number of cores on this box. Adjust yours accordingly.
timechart span=3 avg(cputime) by Name
  • Make it pretty!

* PercentProcessorTime increments forever, if you plot it on a chart it rises over time. When running the math on a single process, it’s fine, the next value is always higher than the last. There are issues around the transitions though. If the last value of PercentProcessorTime for ProcessA is 40000, and the first value for ProcessB is 10000, then the delta is 30000 and it throws the graphs way out of whack. The easiest thing to do is just drop those values. We’re going to pipe the values through an avg() in timechart anyway, so little gaps like that will get smoothed out.

The results

I used a little app called CpuKiller to throttle the CPUs and make the results more obvious.

 

2016-01-06 15_39_39-rabbit - Remote Desktop Connection

Windows performance monitor Green – CPU killer Red – VMWare

 

2016-01-06 15_39_53-Photos

The same data plotted in Splunk. Note that Splunkd doesn’t appear in the perfmon graph because I didn’t include that process, but the WQL being run by Splunk watches all processes.

 

Some of the fun things you can do with this are look at processes from a number of servers across an entire farm (for example, your web servers), or plot application response times against resource utilization. Yay for Splunk!

etcd 2.0 cluster with proxy on Windows Server 2012

I got interested in etcd while working on CoreOS for a work project. I’m no fan of Linux, but I like what etcd offers and how simple it is to setup and manage. Thankfully, the etcd team produces a Windows build with every release. I created a Chocolatey package and got it published, then used it to setup the cluster. The service is hosted by nssm, which also has a Chocolatey package and the extremely helpful nssm edit tool for making changes to the service config. Etcd has great documentation here.

Etcd is installed by running:

choco install -y etcd --params="<service parameters>"

Where service parameters is:

etcd --name "%COMPUTERNAME%" ^
--initial-advertise-peer-urls "http://%COMPUTERNAME%:2380" ^
--listen-peer-urls "http://%COMPUTERNAME%:2380" ^
--listen-client-urls "http://%COMPUTERNAME%:2379,http://127.0.0.1:2379" ^
--advertise-client-urls "http://%COMPUTERNAME%:2379" ^
--discovery "https://discovery.etcd.io/<your_token_here>" ^
--debug

Or you can install it in proxy mode by running:

choco install -y etcd --params="--proxy on --listen-client-urls http://127.0.0.1:2379 --discovery https://discovery.etcd.io/<your_token_here>"

Proxy mode is especially useful because it lets applications running on a machine be ignorant of the etcd cluster. Applications connect to localhost on a known port, and etcd in proxy mode manages finding and using the cluster. Even if the cluster is running on CoreOS, running etcd in proxy mode on Windows is a good way to help Windows apps leverage etcd.

Watch a demo of the whole thing here:

Unattended Windows install after sysprep

I can’t believe how hard this was to find. You can get an unattend.xml that installs call of duty, but you can’t find one that just does as little as possible. This is it, for Windows 2012. I wanted to setup my base machine (which is very thin), sysprep it, and after cloning it, have it not ask any questions. If anyone finds this useful, please enjoy.

The reference for these files is here.

<!-- c:\Windows\System32\Sysprep\sysprep.exe /oobe /generalize /reboot /unattend:C:\unattend.xml -->
<unattend xmlns="urn:schemas-microsoft-com:unattend"> 
<settings pass="windowsPE"> 
      <component name="Microsoft-Windows-International-Core-WinPE" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
            <SetupUILanguage>
                  <WillShowUI>OnError</WillShowUI>
                  <UILanguage>en-US</UILanguage>
            </SetupUILanguage>
            <UILanguage>en-US</UILanguage>
      </component>
</settings>
<settings pass="specialize">
      <component name="Microsoft-Windows-Shell-Setup" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
            <ComputerName>*</ComputerName>
      </component>
</settings>
<settings pass="oobeSystem">
      <component name="Microsoft-Windows-Shell-Setup" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
            <OOBE> 
                  <HideEULAPage>true</HideEULAPage> 
                  <NetworkLocation>Work</NetworkLocation> 
                  <ProtectYourPC>3</ProtectYourPC>
                  <SkipMachineOOBE>true</SkipMachineOOBE> 
                  <SkipUserOOBE>true</SkipUserOOBE>
                  <HideLocalAccountScreen>true</HideLocalAccountScreen>
            </OOBE>
            <UserAccounts>
                <AdministratorPassword>
                    <Value><passwordgoeshere></Value>
                    <PlainText>true</PlainText>
                </AdministratorPassword>            
            </UserAccounts>
            
      </component>
</settings>
</unattend>

AD authentication for RavenDB

RavenDB 2.5.2750 IIS 7.0. Wow, I’ve started a lot of posts in the RavenDB Google group like that. This is what I’ve learned about RavenDB AD authentication.

  1. You need a valid commercial license.
  2. You need to enable both Windows authentication and Anonymous.
  3. Your web.config must have Raven/AnonymousAccess set to none
  4. Users need explicit access to the <system> DB to create databases (all won’t cut it)
  5. Putting users in Backup Operators allows backups (who knew)
  6. Local admins are always admins!

I’m not going to cover the commercial license bit, that’s easy enough.

Authentication modes

It seems obvious that you would enable Windows Authentication and disable Anonymous in IIS. Turns out, this is not the case. According to Oren:

Here is what happens.
We make a Windows Auth request to get a single use token, then we make another request for with that token, which is the actual request.
The reason we do this (for streaming and as well as a bunch of other stuff) is that windows auth can do crazy things like replay requests, and there there is payload to consider.
You still keep Windows auth enabled, so that IIS will handle dealing with the creds, but raven will set the header.

The web.config

The only thing that needs to be set in here is:
    <add key="Raven/AnonymousAccess" value="None" />
You can also set this to bypass authorization if you’re local to the box:
    <add key="Raven/AllowLocalAccessWithoutAuthorization" value="True" />

Permissions

Permissions are set in the system database –> Settings –> Windows Authentication. There are user and group tabs. Once you’ve added a group, push the little green plus icon to add new DBs to that user/group.
ravengrouppermissions
In Raven, all the DB permissions live in the system DB, not the DB itself.

Gotchas

Local admins

Yeah, that’s a few weeks we’ll never get back. Regardless of domain group membership, local admins on the server get admin access. That means if you put contoso\Everyone into SERVER\Administrators, then everyone in contoso gets admin access. Surprise!

Backup Operators

This is another loosely documented feature. If you want non-admin users to be able to backup, make them backup operators. Seems obvious, but it’s not written down.

Testing

Raven has a special URL, https://ravendb.example.com/debug/user-info which will present an authentication challenge and report the users permissions. You’ll get something like:
{"Remark":"Using anonymous user","User":null,"IsAdminGlobal":false,"IsAdminCurrentDb":false,"Databases":null,"Principal":null,"AdminDatabases":null,"ReadOnlyDatabases":null,"ReadWriteDatabases":null,"AccessTokenBody":null}

Migrating from Linux to Windows

Five years ago, I setup a home server on Debian 5. Since then it’s been upgraded to Debian 6, then Debian 7, been ported to newer hardware and had drives added. Today I migrated it to Windows Server 2012 R2. After five years running on Linux, I had simply had enough. I’ll get into the why in a little bit, but first some how.

 

I didn’t have enough space anywhere to backup everything to external storage, so I did it on the same disk. The basic process was as follows:

  • Boot a Knoppix live DVD and use GNU Parted to resize the partition to make room for Windows
  • Install Windows, then install ext2fsd so I could mount the Linux partition
  • Move as much as I could from the Linux partition to Windows
  • Boot off Knoppix, shrink the Linux partition as much as possible
  • Boot back into Windows and expand that partition into the reclaimed space
  • Repeat until there was no EXT3 left

Needless to say it took a while, but it worked pretty well. Ext2fsd is pretty slick, I use it when I have to dual-boot as well, along with the ntfs driver Linux to access my Windows partitions.

 

So now on to the why. Why would someone ever stop using Linux, when Linux is obviously “better” right? The fact is, Linux hasn’t really matured in 5 years. Debian (and it’s illegitimate child, Ubuntu), is a major distro, but still there are no good remote/central management or instrumentation tools, file system ACLs are kludgy and not enabled by default, single-sign-on and integrated authentication is non-existent or randomly implemented, distros still can’t agree on fundamental things like how to configure the network. It’s basically a mess and I got tired of battling with it. Some of my specific beefs are:

  • VBox remote management is junk, Hyper-V is better in that space. VBox has a very low footprint and excellent memory management, but remote administering it is not a strength.
  • Remote administration in general is better in Windows. An SSH prompt is not remote administration, being able to command a farm of web servers to restart in sequence, waiting for one to finish before starting the next, that is remote administration. Cobbling something together in perl/python/bash/whatever to do it is not the same. Also nagios is terrible. Yes, really, it is. Yes it is.
  • Linux printing is pretty terrible. I had CUPS setup, had to use it in raw mode with the Windows driver on my laptop. Luckily Windows supports IPP, I never got print sharing going with Samba
  • Just a general feeling of things being half finished and broken. Even gparted, you have to click out of a text box before the ok prompt will be enabled. Windows GUIs are better, and if you’re a keyboard jockey like me, you can actually navigate the GUI really easily with some well established keyboard shortcuts (except for Chrome, which is it’s own special piece of abhorrent garbage).

Linux is a tool, and like any tool it has it’s uses. I still run Debian VMs in Hyper-V for playing with PHP and Python, but even then after development I would probably run these apps on Windows servers.

 

RavenDB operations practices – Part 1

At The Network, we use RavenDB, a document database written in C#. You can think of it as being similar to MongoDB, but with an HTTP REST API. I’m not a developer at The Network, but I am responsible for operations of the GRC Suite, and this includes RavenDB. What follows are are collection of my notes and experience running RavenDB.

Running under IIS

We run RavenDB as an IIS application. It simplifies management and provides great logs and performance data.

Placement of RavenDB data

Do not keep RavenDB data under the website directory.

The default location of RavenDB data is ~\Databases, which places the data under the website. The problem with this is that when IIS detects a large number of changed static files (such as when you are updating indexes), it will recycle the application pool, and that causes all the databases to unload. So for this reason, we keep all our data, including the system database, outside the application directory.

Application pool settings

Do not let anything recycle the Application Pool

This means:

  • Disable automatic AppPool recycling (by time and memory usage)
  • Disable the WAS ping
  • Disable the idle timeout

Any of these settings which causes the AppPool to recycle will cause all your databases to unload.

Do not enable more than one worker process

This is the default setting and must not be changed. If there is more than one worker process, RavenDB will fight for access to the ESENT database.

Disable overlapping recycle

This is a great feature for websites, since it effectively lets the new process start handling requests while the old process is still completing existing ones. For RavenDB it’s a bad thing, for the same reason as enabling more than one worker process. You wan to avoid an AppPool recycle, but if it happens, you don’t want overlapping.

Disable shutdown Time Limit

Or at least increase it from the default of 90 seconds. This setting tells IIS to kill a process if it hasn’t responded to a shutdown before the time limit expires. When shutting down, Raven is cleanly stopping and unloading databases. If the process is killed, starting a DB will require recovery (done automatically) which just slows down the startup process.

Backing up RavenDB

RavenDB includes a number of backup options including an internal backup scheduler, and an external tool called smuggler. We have 100’s of databases, and needed to take backups every two hours, so we decided to use our SAN to take snapshots.

RavenDB is backed by ESENT, which has ACID transactions and robust crash recovery. Taking snapshots can leave data in an inconsistent state, and the ESENT utility is used to cleanup the DB. Three things must be done in order:

  1. recover – which uses logs to return the DB to a clean shutdown state.
  2. repair – which cleans up corruption. Running repair before recover will result in data loss.
  3. defrag – compacts the database and also repairs indexes
C:\Windows\system32\esentutl.exe /r RVN /l logs /s system /i
C:\Windows\system32\esentutl.exe /p /o Data
C:\Windows\system32\esentutl.exe /d Data

Part 2

I’m currently working on clustering, sharding and authentication for RavenDB. I’ll post a part 2 when those are figured out.

Trusting self-signed certificates in Windows

I had a requirement for users on a computer to automatically trust that computers self-signed certificate. I’m generating and managing these certificates with PowerShell. Googling left me with two general themes:

  1. Use some 3rd party app
  2. Terrible idea! Depraved hackers will eat your brain!

I didn’t want to use a 3rd party app, and I’m not scared of hackers. Eventually I got it, so I’m sharing it with the world.

New-SelfSignedCertificate -DnsName "*.mydomain.com" -CertStoreLocation Cert:\LocalMachine\My
$cert = Get-Item Cert:\LocalMachine\My\* | Where-Object {$_.Subject -eq "CN=*.mydomain.com"}
Export-PfxCertificate -Cert $cert -FilePath $ENV:TEMP\cert.pfx -Password (ConvertTo-SecureString "somepassword" -AsPlainText -Force)
Import-PfxCertificate -FilePath $ENV:TEMP\cert.pfx -Password (ConvertTo-SecureString "somepassword" -AsPlainText -Force) -CertStoreLocation Cert:\LocalMachine\TrustedPeople

New-SelfSignedCertificate is the cmdlet, but it won’t let you stick it in LocalMachine\TrustedPeople. This is why you need to export it as a PFX and re-import it. When you import it, you can put it in TrustedPeople.