etcd 2.0 cluster with proxy on Windows Server 2012

I got interested in etcd while working on CoreOS for a work project. I’m no fan of Linux, but I like what etcd offers and how simple it is to setup and manage. Thankfully, the etcd team produces a Windows build with every release. I created a Chocolatey package and got it published, then used it to setup the cluster. The service is hosted by nssm, which also has a Chocolatey package and the extremely helpful nssm edit tool for making changes to the service config. Etcd has great documentation here.

Etcd is installed by running:

choco install -y etcd --params="<service parameters>"

Where service parameters is:

etcd --name "%COMPUTERNAME%" ^
--initial-advertise-peer-urls "http://%COMPUTERNAME%:2380" ^
--listen-peer-urls "http://%COMPUTERNAME%:2380" ^
--listen-client-urls "http://%COMPUTERNAME%:2379," ^
--advertise-client-urls "http://%COMPUTERNAME%:2379" ^
--discovery "<your_token_here>" ^

Or you can install it in proxy mode by running:

choco install -y etcd --params="--proxy on --listen-client-urls --discovery<your_token_here>"

Proxy mode is especially useful because it lets applications running on a machine be ignorant of the etcd cluster. Applications connect to localhost on a known port, and etcd in proxy mode manages finding and using the cluster. Even if the cluster is running on CoreOS, running etcd in proxy mode on Windows is a good way to help Windows apps leverage etcd.

Watch a demo of the whole thing here:


Unattended Windows install after sysprep

I can’t believe how hard this was to find. You can get an unattend.xml that installs call of duty, but you can’t find one that just does as little as possible. This is it, for Windows 2012. I wanted to setup my base machine (which is very thin), sysprep it, and after cloning it, have it not ask any questions. If anyone finds this useful, please enjoy.

The reference for these files is here.

<!-- c:\Windows\System32\Sysprep\sysprep.exe /oobe /generalize /reboot /unattend:C:\unattend.xml -->
<unattend xmlns="urn:schemas-microsoft-com:unattend"> 
<settings pass="windowsPE"> 
      <component name="Microsoft-Windows-International-Core-WinPE" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
<settings pass="specialize">
      <component name="Microsoft-Windows-Shell-Setup" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
<settings pass="oobeSystem">
      <component name="Microsoft-Windows-Shell-Setup" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">

AD authentication for RavenDB

RavenDB 2.5.2750 IIS 7.0. Wow, I’ve started a lot of posts in the RavenDB Google group like that. This is what I’ve learned about RavenDB AD authentication.

  1. You need a valid commercial license.
  2. You need to enable both Windows authentication and Anonymous.
  3. Your web.config must have Raven/AnonymousAccess set to none
  4. Users need explicit access to the <system> DB to create databases (all won’t cut it)
  5. Putting users in Backup Operators allows backups (who knew)
  6. Local admins are always admins!

I’m not going to cover the commercial license bit, that’s easy enough.

Authentication modes

It seems obvious that you would enable Windows Authentication and disable Anonymous in IIS. Turns out, this is not the case. According to Oren:

Here is what happens.
We make a Windows Auth request to get a single use token, then we make another request for with that token, which is the actual request.
The reason we do this (for streaming and as well as a bunch of other stuff) is that windows auth can do crazy things like replay requests, and there there is payload to consider.
You still keep Windows auth enabled, so that IIS will handle dealing with the creds, but raven will set the header.

The web.config

The only thing that needs to be set in here is:
    <add key="Raven/AnonymousAccess" value="None" />
You can also set this to bypass authorization if you’re local to the box:
    <add key="Raven/AllowLocalAccessWithoutAuthorization" value="True" />


Permissions are set in the system database –> Settings –> Windows Authentication. There are user and group tabs. Once you’ve added a group, push the little green plus icon to add new DBs to that user/group.
In Raven, all the DB permissions live in the system DB, not the DB itself.


Local admins

Yeah, that’s a few weeks we’ll never get back. Regardless of domain group membership, local admins on the server get admin access. That means if you put contoso\Everyone into SERVER\Administrators, then everyone in contoso gets admin access. Surprise!

Backup Operators

This is another loosely documented feature. If you want non-admin users to be able to backup, make them backup operators. Seems obvious, but it’s not written down.


Raven has a special URL, which will present an authentication challenge and report the users permissions. You’ll get something like:
{"Remark":"Using anonymous user","User":null,"IsAdminGlobal":false,"IsAdminCurrentDb":false,"Databases":null,"Principal":null,"AdminDatabases":null,"ReadOnlyDatabases":null,"ReadWriteDatabases":null,"AccessTokenBody":null}

Why I like SNI

This is the case I made to reduce the complexity of mutlihomed web servers by running a multi-tenant configuration with SNI. Yes, I glossed over some areas, but the problems I express and the proposed solution is correct.

When the internet was young, the topology was simple. A browser connected to a host, requested a resource, and the host returned it.
Hosting multiple sites on a single server meant assigning multiple IP addresses to that server. While this solution funcions, it quickly becomes unsustainable.
Modern web applications depend on external resources like databases and storage. A server with multiple IP addresses may use any one of them to initiate connections. If even one of these addresses is misconfigured or is blocked by an internal firewall that address will not be able to reach the requested resource. The result could be frustrating intermittent connectivity issues that are difficult to diagnose. Worse still, there is no set standard for tracking static IP addresses, it’s usually a spreadsheet. That means the IP you think you’re reserving for your website could already be in use by other servers, other hardware, or anything at all. The complexity is compounded by adding more servers and environments. No matter how careful you are with tracking and change management, this risk is real, and managing it is cumbersome and time consuming.
Fortunately HTTP 1.1 helped resolve that problem. By adding the host to the request header, a server could run multiple sites on a single IP, and let the web server direct traffic to the correct application.
It’s just not always that easy. Unencrypted traffic is inherently insecure. SSL, and now TLS is used to encrypt traffic between the client and server. The HTTP payload is encrypted, which means it can’t be used to direct traffic to the correct application. The host doesn’t know where to direct the request. What can be done? Certainly you want to avoid all the risk and complexity and chaos of managing multi-homed servers, but the data still needs to be secure.
The answer is Server Name Indication or SNI. SNI adds the server name field to the TLS handshake so that the destination server knows precisely for which virtual host the traffic is intended. The server must still present a trusted certificate for the requested host, but it can now support multiple hosts. The resulting simplified network topology reduces the risk that a misconfiguration could impact access to internal resources.
SNI is a widely adopted proposed standard from the IETF and defined in RFC 3546 and RFC 6066. It is supported by modern operating systems browsers web servers and network appliances.

It is not supported by Internet Explorer running on Windows XP. Microsoft wrote their own implementation of SSL, which is good because Windows was not affected by heart-bleed or other vulnerabilities in OpenSSL, but it’s also a hindrance because the company chose not to back port the feature to a decade old version of Windows. Windows XP users can use Firefox or Chrome, which supports SNI through OpenSSL or upgrade to Windows Vista or higher which does support SNI. There is simply no support for IE on XP, a 13 year old out dated and unsupported operating system.
Hope is not lost however. Using a modern load balancer, web applications can present multiple IP addresses to the public internet, while using SNI for inside communication. This solution balances the need to support the widest possible user-base, without hampering  developers and operations teams with managing an unwieldy and complex network topology.

The problem with moar testing

I’m going to start this off with a story, and when it’s done, we’ll come back to software testing.

The Boeing 737 is the most popular passenger aircraft in the world. The type first flew in 1967, and with four major revisions, is still being produced and flown today.

On March 3, 1991, United Airlines Flight 585, a 737-200, crashed in Colorado Springs, killing 25 people. The plane was on final approach into Colorado Springs when it suddenly banked hard, pitched nose down and plunged into the ground at 245 MPH. Following the tragedy, the NTSB sifted through the rubble and were unable to determine the cause.

On September 8, 1994, USAir Flight 427, a 737-300, crashed near Pittsburgh, PA, killing 132 people. Again the NTSB, the premiere accident investigation agency in the world, was stumped. Boeing 737’s were falling out of the sky, and over 150 people were dead. On June 9, 1996, a third 737, Eastwind Airlines Flight 517 experienced the same hard bank as the other two planes. Miraculously, the pilot was able to recover the plane and land safely. The break for investigators was that the plane was intact, and could be investigated.

The investigators focused on a piece of hydraulic equipment called the Power Control Unit (PCU) which is responsible for controlling the rudder in the planes vertical stabilizer. Undamaged, the unit was put through the standard battery of tests and performed flawlessly. Ultimately, a test was performed where the unit was chilled to -40 and tested using heated hydraulic fluid. Finally the unit jammed, which if it had been installed, would have pushed the rudder to it’s blowdown limit and crash an airplane.

The story here though isn’t the investigation or the tragic loss of life, but rather the story of all the testing that DID happen. The PCU had passed all it’s individual tests: operating under load, under certain temperatures, number of duty cycles, etc; what developers would call “unit tests”. The PCU had also functioned property during the mandatory pre-flight tests; what developers would call “integration tests”. Finally, the PCU had performed 1000’s of prior take-offs and landings on flight 585 prior to it’s crash; what developers would call in-production testing.

The airplane was built by Boeing, and it took a year from the prototype rolling out to the plane getting it’s type certification from the FAA. The PCU is build by Parker Hannifin, an industry leader in the area of aerospace and hydraulics. So here is an airplane, built by established industry leaders, the model tested for a year and totally unchanged, subjected to a battery of individual component tests, flown for almost a decade without incident, and yet still fell out of the sky. How? Simple. The edge case. Very hot fluid into a very cold part wasn’t something engineers expected, and it didn’t happen often, so it was never tested.

How does this apply to software development? Well, for starters, try telling product owners or company owners that the design is going to freeze for a year for extensive testing, and then will only change once per decade after that. Good luck selling it. Somehow software developers (and IT engineers) are expected to meet or exceed the level of testing reserved for the aerospace industry, and maintain a nearly constant rate of change, even when all that testing still doesn’t fully prevent plane crashes.

The moral of the story here is that no matter how much testing you do, it’s the edge cases that will get you: the infinite number of human decisions combined with environmental conditions that are totally impossible to predict. No one ever thought to consider the effect of thermal shock on a hydraulic valve, but it happened. I’m not suggesting that testing is pointless, just that there is no such thing as perfect testing. The only response is to investigate and to make changes to prevent it in the future. Software developers will get caught on the edge cases, but at least your website isn’t going to kill anyone.

Mayday S04E04 does a pretty good job of summarizing the Boeing 737 rudder issues, and Wikipedia has an article on the subject if you interested in learning more.

Migrating from Linux to Windows

Five years ago, I setup a home server on Debian 5. Since then it’s been upgraded to Debian 6, then Debian 7, been ported to newer hardware and had drives added. Today I migrated it to Windows Server 2012 R2. After five years running on Linux, I had simply had enough. I’ll get into the why in a little bit, but first some how.


I didn’t have enough space anywhere to backup everything to external storage, so I did it on the same disk. The basic process was as follows:

  • Boot a Knoppix live DVD and use GNU Parted to resize the partition to make room for Windows
  • Install Windows, then install ext2fsd so I could mount the Linux partition
  • Move as much as I could from the Linux partition to Windows
  • Boot off Knoppix, shrink the Linux partition as much as possible
  • Boot back into Windows and expand that partition into the reclaimed space
  • Repeat until there was no EXT3 left

Needless to say it took a while, but it worked pretty well. Ext2fsd is pretty slick, I use it when I have to dual-boot as well, along with the ntfs driver Linux to access my Windows partitions.


So now on to the why. Why would someone ever stop using Linux, when Linux is obviously “better” right? The fact is, Linux hasn’t really matured in 5 years. Debian (and it’s illegitimate child, Ubuntu), is a major distro, but still there are no good remote/central management or instrumentation tools, file system ACLs are kludgy and not enabled by default, single-sign-on and integrated authentication is non-existent or randomly implemented, distros still can’t agree on fundamental things like how to configure the network. It’s basically a mess and I got tired of battling with it. Some of my specific beefs are:

  • VBox remote management is junk, Hyper-V is better in that space. VBox has a very low footprint and excellent memory management, but remote administering it is not a strength.
  • Remote administration in general is better in Windows. An SSH prompt is not remote administration, being able to command a farm of web servers to restart in sequence, waiting for one to finish before starting the next, that is remote administration. Cobbling something together in perl/python/bash/whatever to do it is not the same. Also nagios is terrible. Yes, really, it is. Yes it is.
  • Linux printing is pretty terrible. I had CUPS setup, had to use it in raw mode with the Windows driver on my laptop. Luckily Windows supports IPP, I never got print sharing going with Samba
  • Just a general feeling of things being half finished and broken. Even gparted, you have to click out of a text box before the ok prompt will be enabled. Windows GUIs are better, and if you’re a keyboard jockey like me, you can actually navigate the GUI really easily with some well established keyboard shortcuts (except for Chrome, which is it’s own special piece of abhorrent garbage).

Linux is a tool, and like any tool it has it’s uses. I still run Debian VMs in Hyper-V for playing with PHP and Python, but even then after development I would probably run these apps on Windows servers.


RavenDB operations practices – Part 1

At The Network, we use RavenDB, a document database written in C#. You can think of it as being similar to MongoDB, but with an HTTP REST API. I’m not a developer at The Network, but I am responsible for operations of the GRC Suite, and this includes RavenDB. What follows are are collection of my notes and experience running RavenDB.

Running under IIS

We run RavenDB as an IIS application. It simplifies management and provides great logs and performance data.

Placement of RavenDB data

Do not keep RavenDB data under the website directory.

The default location of RavenDB data is ~\Databases, which places the data under the website. The problem with this is that when IIS detects a large number of changed static files (such as when you are updating indexes), it will recycle the application pool, and that causes all the databases to unload. So for this reason, we keep all our data, including the system database, outside the application directory.

Application pool settings

Do not let anything recycle the Application Pool

This means:

  • Disable automatic AppPool recycling (by time and memory usage)
  • Disable the WAS ping
  • Disable the idle timeout

Any of these settings which causes the AppPool to recycle will cause all your databases to unload.

Do not enable more than one worker process

This is the default setting and must not be changed. If there is more than one worker process, RavenDB will fight for access to the ESENT database.

Disable overlapping recycle

This is a great feature for websites, since it effectively lets the new process start handling requests while the old process is still completing existing ones. For RavenDB it’s a bad thing, for the same reason as enabling more than one worker process. You wan to avoid an AppPool recycle, but if it happens, you don’t want overlapping.

Disable shutdown Time Limit

Or at least increase it from the default of 90 seconds. This setting tells IIS to kill a process if it hasn’t responded to a shutdown before the time limit expires. When shutting down, Raven is cleanly stopping and unloading databases. If the process is killed, starting a DB will require recovery (done automatically) which just slows down the startup process.

Backing up RavenDB

RavenDB includes a number of backup options including an internal backup scheduler, and an external tool called smuggler. We have 100’s of databases, and needed to take backups every two hours, so we decided to use our SAN to take snapshots.

RavenDB is backed by ESENT, which has ACID transactions and robust crash recovery. Taking snapshots can leave data in an inconsistent state, and the ESENT utility is used to cleanup the DB. Three things must be done in order:

  1. recover – which uses logs to return the DB to a clean shutdown state.
  2. repair – which cleans up corruption. Running repair before recover will result in data loss.
  3. defrag – compacts the database and also repairs indexes
C:\Windows\system32\esentutl.exe /r RVN /l logs /s system /i
C:\Windows\system32\esentutl.exe /p /o Data
C:\Windows\system32\esentutl.exe /d Data

Part 2

I’m currently working on clustering, sharding and authentication for RavenDB. I’ll post a part 2 when those are figured out.