AD authentication for RavenDB

RavenDB 2.5.2750 IIS 7.0. Wow, I’ve started a lot of posts in the RavenDB Google group like that. This is what I’ve learned about RavenDB AD authentication.

  1. You need a valid commercial license.
  2. You need to enable both Windows authentication and Anonymous.
  3. Your web.config must have Raven/AnonymousAccess set to none
  4. Users need explicit access to the <system> DB to create databases (all won’t cut it)
  5. Putting users in Backup Operators allows backups (who knew)
  6. Local admins are always admins!

I’m not going to cover the commercial license bit, that’s easy enough.

Authentication modes

It seems obvious that you would enable Windows Authentication and disable Anonymous in IIS. Turns out, this is not the case. According to Oren:

Here is what happens.
We make a Windows Auth request to get a single use token, then we make another request for with that token, which is the actual request.
The reason we do this (for streaming and as well as a bunch of other stuff) is that windows auth can do crazy things like replay requests, and there there is payload to consider.
You still keep Windows auth enabled, so that IIS will handle dealing with the creds, but raven will set the header.

The web.config

The only thing that needs to be set in here is:
    <add key="Raven/AnonymousAccess" value="None" />
You can also set this to bypass authorization if you’re local to the box:
    <add key="Raven/AllowLocalAccessWithoutAuthorization" value="True" />

Permissions

Permissions are set in the system database –> Settings –> Windows Authentication. There are user and group tabs. Once you’ve added a group, push the little green plus icon to add new DBs to that user/group.
ravengrouppermissions
In Raven, all the DB permissions live in the system DB, not the DB itself.

Gotchas

Local admins

Yeah, that’s a few weeks we’ll never get back. Regardless of domain group membership, local admins on the server get admin access. That means if you put contoso\Everyone into SERVER\Administrators, then everyone in contoso gets admin access. Surprise!

Backup Operators

This is another loosely documented feature. If you want non-admin users to be able to backup, make them backup operators. Seems obvious, but it’s not written down.

Testing

Raven has a special URL, https://ravendb.example.com/debug/user-info which will present an authentication challenge and report the users permissions. You’ll get something like:
{"Remark":"Using anonymous user","User":null,"IsAdminGlobal":false,"IsAdminCurrentDb":false,"Databases":null,"Principal":null,"AdminDatabases":null,"ReadOnlyDatabases":null,"ReadWriteDatabases":null,"AccessTokenBody":null}

Why I like SNI

This is the case I made to reduce the complexity of mutlihomed web servers by running a multi-tenant configuration with SNI. Yes, I glossed over some areas, but the problems I express and the proposed solution is correct.

When the internet was young, the topology was simple. A browser connected to a host, requested a resource, and the host returned it.
Hosting multiple sites on a single server meant assigning multiple IP addresses to that server. While this solution funcions, it quickly becomes unsustainable.
Modern web applications depend on external resources like databases and storage. A server with multiple IP addresses may use any one of them to initiate connections. If even one of these addresses is misconfigured or is blocked by an internal firewall that address will not be able to reach the requested resource. The result could be frustrating intermittent connectivity issues that are difficult to diagnose. Worse still, there is no set standard for tracking static IP addresses, it’s usually a spreadsheet. That means the IP you think you’re reserving for your website could already be in use by other servers, other hardware, or anything at all. The complexity is compounded by adding more servers and environments. No matter how careful you are with tracking and change management, this risk is real, and managing it is cumbersome and time consuming.
Fortunately HTTP 1.1 helped resolve that problem. By adding the host to the request header, a server could run multiple sites on a single IP, and let the web server direct traffic to the correct application.
It’s just not always that easy. Unencrypted traffic is inherently insecure. SSL, and now TLS is used to encrypt traffic between the client and server. The HTTP payload is encrypted, which means it can’t be used to direct traffic to the correct application. The host doesn’t know where to direct the request. What can be done? Certainly you want to avoid all the risk and complexity and chaos of managing multi-homed servers, but the data still needs to be secure.
The answer is Server Name Indication or SNI. SNI adds the server name field to the TLS handshake so that the destination server knows precisely for which virtual host the traffic is intended. The server must still present a trusted certificate for the requested host, but it can now support multiple hosts. The resulting simplified network topology reduces the risk that a misconfiguration could impact access to internal resources.
SNI is a widely adopted proposed standard from the IETF and defined in RFC 3546 and RFC 6066. It is supported by modern operating systems browsers web servers and network appliances.

 
It is not supported by Internet Explorer running on Windows XP. Microsoft wrote their own implementation of SSL, which is good because Windows was not affected by heart-bleed or other vulnerabilities in OpenSSL, but it’s also a hindrance because the company chose not to back port the feature to a decade old version of Windows. Windows XP users can use Firefox or Chrome, which supports SNI through OpenSSL or upgrade to Windows Vista or higher which does support SNI. There is simply no support for IE on XP, a 13 year old out dated and unsupported operating system.
Hope is not lost however. Using a modern load balancer, web applications can present multiple IP addresses to the public internet, while using SNI for inside communication. This solution balances the need to support the widest possible user-base, without hampering  developers and operations teams with managing an unwieldy and complex network topology.

The problem with moar testing

I’m going to start this off with a story, and when it’s done, we’ll come back to software testing.

The Boeing 737 is the most popular passenger aircraft in the world. The type first flew in 1967, and with four major revisions, is still being produced and flown today.

On March 3, 1991, United Airlines Flight 585, a 737-200, crashed in Colorado Springs, killing 25 people. The plane was on final approach into Colorado Springs when it suddenly banked hard, pitched nose down and plunged into the ground at 245 MPH. Following the tragedy, the NTSB sifted through the rubble and were unable to determine the cause.

On September 8, 1994, USAir Flight 427, a 737-300, crashed near Pittsburgh, PA, killing 132 people. Again the NTSB, the premiere accident investigation agency in the world, was stumped. Boeing 737’s were falling out of the sky, and over 150 people were dead. On June 9, 1996, a third 737, Eastwind Airlines Flight 517 experienced the same hard bank as the other two planes. Miraculously, the pilot was able to recover the plane and land safely. The break for investigators was that the plane was intact, and could be investigated.

The investigators focused on a piece of hydraulic equipment called the Power Control Unit (PCU) which is responsible for controlling the rudder in the planes vertical stabilizer. Undamaged, the unit was put through the standard battery of tests and performed flawlessly. Ultimately, a test was performed where the unit was chilled to -40 and tested using heated hydraulic fluid. Finally the unit jammed, which if it had been installed, would have pushed the rudder to it’s blowdown limit and crash an airplane.

The story here though isn’t the investigation or the tragic loss of life, but rather the story of all the testing that DID happen. The PCU had passed all it’s individual tests: operating under load, under certain temperatures, number of duty cycles, etc; what developers would call “unit tests”. The PCU had also functioned property during the mandatory pre-flight tests; what developers would call “integration tests”. Finally, the PCU had performed 1000’s of prior take-offs and landings on flight 585 prior to it’s crash; what developers would call in-production testing.

The airplane was built by Boeing, and it took a year from the prototype rolling out to the plane getting it’s type certification from the FAA. The PCU is build by Parker Hannifin, an industry leader in the area of aerospace and hydraulics. So here is an airplane, built by established industry leaders, the model tested for a year and totally unchanged, subjected to a battery of individual component tests, flown for almost a decade without incident, and yet still fell out of the sky. How? Simple. The edge case. Very hot fluid into a very cold part wasn’t something engineers expected, and it didn’t happen often, so it was never tested.

How does this apply to software development? Well, for starters, try telling product owners or company owners that the design is going to freeze for a year for extensive testing, and then will only change once per decade after that. Good luck selling it. Somehow software developers (and IT engineers) are expected to meet or exceed the level of testing reserved for the aerospace industry, and maintain a nearly constant rate of change, even when all that testing still doesn’t fully prevent plane crashes.

The moral of the story here is that no matter how much testing you do, it’s the edge cases that will get you: the infinite number of human decisions combined with environmental conditions that are totally impossible to predict. No one ever thought to consider the effect of thermal shock on a hydraulic valve, but it happened. I’m not suggesting that testing is pointless, just that there is no such thing as perfect testing. The only response is to investigate and to make changes to prevent it in the future. Software developers will get caught on the edge cases, but at least your website isn’t going to kill anyone.

Mayday S04E04 does a pretty good job of summarizing the Boeing 737 rudder issues, and Wikipedia has an article on the subject if you interested in learning more.

Migrating from Linux to Windows

Five years ago, I setup a home server on Debian 5. Since then it’s been upgraded to Debian 6, then Debian 7, been ported to newer hardware and had drives added. Today I migrated it to Windows Server 2012 R2. After five years running on Linux, I had simply had enough. I’ll get into the why in a little bit, but first some how.

 

I didn’t have enough space anywhere to backup everything to external storage, so I did it on the same disk. The basic process was as follows:

  • Boot a Knoppix live DVD and use GNU Parted to resize the partition to make room for Windows
  • Install Windows, then install ext2fsd so I could mount the Linux partition
  • Move as much as I could from the Linux partition to Windows
  • Boot off Knoppix, shrink the Linux partition as much as possible
  • Boot back into Windows and expand that partition into the reclaimed space
  • Repeat until there was no EXT3 left

Needless to say it took a while, but it worked pretty well. Ext2fsd is pretty slick, I use it when I have to dual-boot as well, along with the ntfs driver Linux to access my Windows partitions.

 

So now on to the why. Why would someone ever stop using Linux, when Linux is obviously “better” right? The fact is, Linux hasn’t really matured in 5 years. Debian (and it’s illegitimate child, Ubuntu), is a major distro, but still there are no good remote/central management or instrumentation tools, file system ACLs are kludgy and not enabled by default, single-sign-on and integrated authentication is non-existent or randomly implemented, distros still can’t agree on fundamental things like how to configure the network. It’s basically a mess and I got tired of battling with it. Some of my specific beefs are:

  • VBox remote management is junk, Hyper-V is better in that space. VBox has a very low footprint and excellent memory management, but remote administering it is not a strength.
  • Remote administration in general is better in Windows. An SSH prompt is not remote administration, being able to command a farm of web servers to restart in sequence, waiting for one to finish before starting the next, that is remote administration. Cobbling something together in perl/python/bash/whatever to do it is not the same. Also nagios is terrible. Yes, really, it is. Yes it is.
  • Linux printing is pretty terrible. I had CUPS setup, had to use it in raw mode with the Windows driver on my laptop. Luckily Windows supports IPP, I never got print sharing going with Samba
  • Just a general feeling of things being half finished and broken. Even gparted, you have to click out of a text box before the ok prompt will be enabled. Windows GUIs are better, and if you’re a keyboard jockey like me, you can actually navigate the GUI really easily with some well established keyboard shortcuts (except for Chrome, which is it’s own special piece of abhorrent garbage).

Linux is a tool, and like any tool it has it’s uses. I still run Debian VMs in Hyper-V for playing with PHP and Python, but even then after development I would probably run these apps on Windows servers.

 

RavenDB operations practices – Part 1

At The Network, we use RavenDB, a document database written in C#. You can think of it as being similar to MongoDB, but with an HTTP REST API. I’m not a developer at The Network, but I am responsible for operations of the GRC Suite, and this includes RavenDB. What follows are are collection of my notes and experience running RavenDB.

Running under IIS

We run RavenDB as an IIS application. It simplifies management and provides great logs and performance data.

Placement of RavenDB data

Do not keep RavenDB data under the website directory.

The default location of RavenDB data is ~\Databases, which places the data under the website. The problem with this is that when IIS detects a large number of changed static files (such as when you are updating indexes), it will recycle the application pool, and that causes all the databases to unload. So for this reason, we keep all our data, including the system database, outside the application directory.

Application pool settings

Do not let anything recycle the Application Pool

This means:

  • Disable automatic AppPool recycling (by time and memory usage)
  • Disable the WAS ping
  • Disable the idle timeout

Any of these settings which causes the AppPool to recycle will cause all your databases to unload.

Do not enable more than one worker process

This is the default setting and must not be changed. If there is more than one worker process, RavenDB will fight for access to the ESENT database.

Disable overlapping recycle

This is a great feature for websites, since it effectively lets the new process start handling requests while the old process is still completing existing ones. For RavenDB it’s a bad thing, for the same reason as enabling more than one worker process. You wan to avoid an AppPool recycle, but if it happens, you don’t want overlapping.

Disable shutdown Time Limit

Or at least increase it from the default of 90 seconds. This setting tells IIS to kill a process if it hasn’t responded to a shutdown before the time limit expires. When shutting down, Raven is cleanly stopping and unloading databases. If the process is killed, starting a DB will require recovery (done automatically) which just slows down the startup process.

Backing up RavenDB

RavenDB includes a number of backup options including an internal backup scheduler, and an external tool called smuggler. We have 100’s of databases, and needed to take backups every two hours, so we decided to use our SAN to take snapshots.

RavenDB is backed by ESENT, which has ACID transactions and robust crash recovery. Taking snapshots can leave data in an inconsistent state, and the ESENT utility is used to cleanup the DB. Three things must be done in order:

  1. recover – which uses logs to return the DB to a clean shutdown state.
  2. repair – which cleans up corruption. Running repair before recover will result in data loss.
  3. defrag – compacts the database and also repairs indexes
C:\Windows\system32\esentutl.exe /r RVN /l logs /s system /i
C:\Windows\system32\esentutl.exe /p /o Data
C:\Windows\system32\esentutl.exe /d Data

Part 2

I’m currently working on clustering, sharding and authentication for RavenDB. I’ll post a part 2 when those are figured out.

Splunk user agent string lookups with TA-browscap_express

I got a requirement to find out what browsers our clients are using. We run a SaaS product, and every client is clientname.ourdomain.com, so I could use the cs_hostname field in the log. Using a 3rd party analytic tool was totally out of the question, all I had to go on were the IIS logs.

We’re already getting the IIS logs into Splunk, so with a bit of Googling I found the TA-browscap app by Dave Shpritz. It’s powered by the browscap project and it works. The problem is that the browscap file is now 18MB and searching it has become very slow. What started as an hack to cache matches in a separate file has turned into a total fork and re-write of most of the app, and has become TA-browscap_express.

There are installation instructions on the application page at Splunk.com, also in the GitHub repo, so I won’t rehash them here.

The Browscap file

The Browser Capabilities Project (browscap) is an effort to identify all known User Agent (UA) strings, which regretfully are a total mess. The project is active, and the data is accurate. They provide the data in a number of formats, the legacy INI file still used by PHP and ASP, and a CSV file, among others. The file is 18MB and 58,000 lines long.

The structure of the file is a name pattern for a UA string, followed by all the known properties. My UA string is:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0

And the matching name pattern is:

Mozilla/5.0 (*Windows NT 6.1*WOW64*) Gecko* Firefox/31.0*

The example above matches FireFox 31 on Windows 7 x64. Now here is an interesting challenge:

Mozilla/5.0 (*Windows NT 6.1*) Gecko* Firefox/31.0*

This name pattern matches FireFox 31 ono Windows 7 x86. It also matches x64. If you take the first match, you’ll get the wrong information. To get an accurate lookup, you need to compare all 58,000 name patterns, and the longest one which matches is the most correct. You can imagine, this is quite a challenge.

Parsing the browscap.csv file

The TA-browscap app uses pybrowscap, which is a Python library for parsing and managing the browscap.csv file. The library returns an object with properties for all the fields in the browscap file. I didn’t want to check 58,000 name patterns every time, so I wanted the successful pattern as well. pybrowscap doesn’t provide it, and it’s actually hard to re-create because they’re using python’s internal CSV buster.

The solution was to lift the core logic from pybrowscap and re-write it myself, busting strings as CSV data instead of files. The first thing you have to do is convert the name pattern into regex, which is easy, then compare your challenge string against it. Like I described above, after that  you loop through every name pattern until you find the longest match.

Knowing what to cache

The last entry in the file is “*” which will match anything. It returns a set of properties called “Default Browser” where everything is false. The idea is you’ll always get some response, you won’t get null. I didn’t want to cache these “Generic” or “Default” browsers, because once they’re in the cache, they’ll come up for every new UA string, and the data will be junk.

How it works

The app (TA-browscap_express) caches matched UA strings in a file and searches it first. During a query it also keeps matches in memory, using the memory cache before the disk cache. It also supports blacklisting obviously bad UA strings and storing the cache file on a network share to help with distributed search.

When a UA string is passed into the app, it checks 4 things:

  1. Is it blacklisted? If yes, return default browser.
  2. Is it in the memory cache? If yes, return the entry.
  3. Is it in the browscap_lite.csv file? If yes, add to memory cache and return.
  4. Is it in browscap.csv? If yes, add to browscap_lite.csv, the memory cache and return.
  5. If totally unidentifiable, return the default browser.

 

Browscap_lite.csv

The browscap_lite.csv file is the cache file which is checked before browscap.csv. It’s in the same format, and has the same fields. Matched UA strings are written to it.

The default location for the file is in the application\bin directory for the app. In a Splunk distributed environment, that’s not really a good idea. You never know what search head or indexer the app will run on, and you’ll end up rebuilding the cache over and over. The browscap_lookup.ini file lets you specify a location for the file.

The blacklist

Some UA strings are junk, and won’t ever be added to the browscap file. Others could be added, and I suggest you report any new strings to browscap.org, but it may take a while, or you just don’t care. The blacklist.txt file is used to weed out garbage UA strings so that you don’t waste time trying to look them up in the browscap.csv file, only to get “*” Default Browser.

 

The fields

The app returns all fields available in the browscap file. Two I find especially interesting are:

ua_comment This is a combination of the browser name and version, so that Internet Explorer 11 becomes IE 11.

ua_platform This is a combination of the operating system and version, so that Windows 7 becomes Win7.

 

There is one additional field which I added, ua_fromcache, which returns true, false, or blacklist, depending on where the data came from.

 

The demo

Sorry in advance for my droning voice, but if you want to see the setup and usage in action, check out my Youtube Video.

 

Trusting self-signed certificates in Windows

I had a requirement for users on a computer to automatically trust that computers self-signed certificate. I’m generating and managing these certificates with PowerShell. Googling left me with two general themes:

  1. Use some 3rd party app
  2. Terrible idea! Depraved hackers will eat your brain!

I didn’t want to use a 3rd party app, and I’m not scared of hackers. Eventually I got it, so I’m sharing it with the world.

New-SelfSignedCertificate -DnsName "*.mydomain.com" -CertStoreLocation Cert:\LocalMachine\My
$cert = Get-Item Cert:\LocalMachine\My\* | Where-Object {$_.Subject -eq "CN=*.mydomain.com"}
Export-PfxCertificate -Cert $cert -FilePath $ENV:TEMP\cert.pfx -Password (ConvertTo-SecureString "somepassword" -AsPlainText -Force)
Import-PfxCertificate -FilePath $ENV:TEMP\cert.pfx -Password (ConvertTo-SecureString "somepassword" -AsPlainText -Force) -CertStoreLocation Cert:\LocalMachine\TrustedPeople

New-SelfSignedCertificate is the cmdlet, but it won’t let you stick it in LocalMachine\TrustedPeople. This is why you need to export it as a PFX and re-import it. When you import it, you can put it in TrustedPeople.