Why I like SNI

This is the case I made to reduce the complexity of mutlihomed web servers by running a multi-tenant configuration with SNI. Yes, I glossed over some areas, but the problems I express and the proposed solution is correct.

When the internet was young, the topology was simple. A browser connected to a host, requested a resource, and the host returned it.
Hosting multiple sites on a single server meant assigning multiple IP addresses to that server. While this solution funcions, it quickly becomes unsustainable.
Modern web applications depend on external resources like databases and storage. A server with multiple IP addresses may use any one of them to initiate connections. If even one of these addresses is misconfigured or is blocked by an internal firewall that address will not be able to reach the requested resource. The result could be frustrating intermittent connectivity issues that are difficult to diagnose. Worse still, there is no set standard for tracking static IP addresses, it’s usually a spreadsheet. That means the IP you think you’re reserving for your website could already be in use by other servers, other hardware, or anything at all. The complexity is compounded by adding more servers and environments. No matter how careful you are with tracking and change management, this risk is real, and managing it is cumbersome and time consuming.
Fortunately HTTP 1.1 helped resolve that problem. By adding the host to the request header, a server could run multiple sites on a single IP, and let the web server direct traffic to the correct application.
It’s just not always that easy. Unencrypted traffic is inherently insecure. SSL, and now TLS is used to encrypt traffic between the client and server. The HTTP payload is encrypted, which means it can’t be used to direct traffic to the correct application. The host doesn’t know where to direct the request. What can be done? Certainly you want to avoid all the risk and complexity and chaos of managing multi-homed servers, but the data still needs to be secure.
The answer is Server Name Indication or SNI. SNI adds the server name field to the TLS handshake so that the destination server knows precisely for which virtual host the traffic is intended. The server must still present a trusted certificate for the requested host, but it can now support multiple hosts. The resulting simplified network topology reduces the risk that a misconfiguration could impact access to internal resources.
SNI is a widely adopted proposed standard from the IETF and defined in RFC 3546 and RFC 6066. It is supported by modern operating systems browsers web servers and network appliances.

 
It is not supported by Internet Explorer running on Windows XP. Microsoft wrote their own implementation of SSL, which is good because Windows was not affected by heart-bleed or other vulnerabilities in OpenSSL, but it’s also a hindrance because the company chose not to back port the feature to a decade old version of Windows. Windows XP users can use Firefox or Chrome, which supports SNI through OpenSSL or upgrade to Windows Vista or higher which does support SNI. There is simply no support for IE on XP, a 13 year old out dated and unsupported operating system.
Hope is not lost however. Using a modern load balancer, web applications can present multiple IP addresses to the public internet, while using SNI for inside communication. This solution balances the need to support the widest possible user-base, without hampering  developers and operations teams with managing an unwieldy and complex network topology.

The problem with moar testing

I’m going to start this off with a story, and when it’s done, we’ll come back to software testing.

The Boeing 737 is the most popular passenger aircraft in the world. The type first flew in 1967, and with four major revisions, is still being produced and flown today.

On March 3, 1991, United Airlines Flight 585, a 737-200, crashed in Colorado Springs, killing 25 people. The plane was on final approach into Colorado Springs when it suddenly banked hard, pitched nose down and plunged into the ground at 245 MPH. Following the tragedy, the NTSB sifted through the rubble and were unable to determine the cause.

On September 8, 1994, USAir Flight 427, a 737-300, crashed near Pittsburgh, PA, killing 132 people. Again the NTSB, the premiere accident investigation agency in the world, was stumped. Boeing 737’s were falling out of the sky, and over 150 people were dead. On June 9, 1996, a third 737, Eastwind Airlines Flight 517 experienced the same hard bank as the other two planes. Miraculously, the pilot was able to recover the plane and land safely. The break for investigators was that the plane was intact, and could be investigated.

The investigators focused on a piece of hydraulic equipment called the Power Control Unit (PCU) which is responsible for controlling the rudder in the planes vertical stabilizer. Undamaged, the unit was put through the standard battery of tests and performed flawlessly. Ultimately, a test was performed where the unit was chilled to -40 and tested using heated hydraulic fluid. Finally the unit jammed, which if it had been installed, would have pushed the rudder to it’s blowdown limit and crash an airplane.

The story here though isn’t the investigation or the tragic loss of life, but rather the story of all the testing that DID happen. The PCU had passed all it’s individual tests: operating under load, under certain temperatures, number of duty cycles, etc; what developers would call “unit tests”. The PCU had also functioned property during the mandatory pre-flight tests; what developers would call “integration tests”. Finally, the PCU had performed 1000’s of prior take-offs and landings on flight 585 prior to it’s crash; what developers would call in-production testing.

The airplane was built by Boeing, and it took a year from the prototype rolling out to the plane getting it’s type certification from the FAA. The PCU is build by Parker Hannifin, an industry leader in the area of aerospace and hydraulics. So here is an airplane, built by established industry leaders, the model tested for a year and totally unchanged, subjected to a battery of individual component tests, flown for almost a decade without incident, and yet still fell out of the sky. How? Simple. The edge case. Very hot fluid into a very cold part wasn’t something engineers expected, and it didn’t happen often, so it was never tested.

How does this apply to software development? Well, for starters, try telling product owners or company owners that the design is going to freeze for a year for extensive testing, and then will only change once per decade after that. Good luck selling it. Somehow software developers (and IT engineers) are expected to meet or exceed the level of testing reserved for the aerospace industry, and maintain a nearly constant rate of change, even when all that testing still doesn’t fully prevent plane crashes.

The moral of the story here is that no matter how much testing you do, it’s the edge cases that will get you: the infinite number of human decisions combined with environmental conditions that are totally impossible to predict. No one ever thought to consider the effect of thermal shock on a hydraulic valve, but it happened. I’m not suggesting that testing is pointless, just that there is no such thing as perfect testing. The only response is to investigate and to make changes to prevent it in the future. Software developers will get caught on the edge cases, but at least your website isn’t going to kill anyone.

Mayday S04E04 does a pretty good job of summarizing the Boeing 737 rudder issues, and Wikipedia has an article on the subject if you interested in learning more.