Load testing fluentd with wrk2 and OpenResty

We’ve written some complicated transform rules for fluentd to add fields to the record, serialize it with Apache Avro and dump it all into kafka. That was fine, until some service owners said “Hey, we don’t want to learn this Avro stuff, just let is write to fluentd on UDP” Management decided we should load test fluentd, so the ticket I got was:


“The test ideally should be UDP events sent to localhost at a rate of 1,500, 3,000 and 10,000 events per second.”


Great, so how do I manage that? A colleague told me about wrk2, which generates HTTP requests at precise rates, is fast, and has a low system impact. The thing is, it ONLY generates HTTP GET requests, and it’s written in C, which I wasn’t about to bother learning for this one-off effort.


Enter OpenResty. If you’re not familiar, OpenResty is nginx with a LUA interpreter built in, so you can write LUA scripts to handle requests. There are a lot of things OpenResty is wrong for, but this was one of the things I thought it could do well. Fast, multi-threaded, efficient. What does that have to do with “UDP events sent to localhost”? Well, I wrote just enough LUA to do  that, based on GET requests generated by wrk2. The LUA script is here if you want to have a look. It returns 503 if it can’t send the request. In the code, you’ll see reference to “somefile.json”, that’s the mock payload which is sent to fluentd.


I start wrk, generating 1500 requests a second, with 10 threads, 10 connections to nginx, for 10 minutes:

./wrk -t10 -c10 -d600s -R1500 http://localhost:10082/udp


The results were interesting. fluentd is single threaded, so it just ran the core it was on at 100% while it read UDP packets out of the kernel buffer, but, it didn’t drop not one. When I ran it for 60 seconds, it generated 90,000 events and sure enough I got 90,000 Kafka messages out the other end of the pipeline. Way to go Ruby!



Upgrading chef server 11 to version 12

This is the story of how we upgraded Chef server 11 to Chef server 12, open source, on CentOS 6.


Step 0: Work off a snapshot of your server. I can’t stress this enough. You’re going to have issues. Use a snapshot.

The process we followed breaks down into three basic phases:

  1. Upgrade to the latest Chef Server 11
  2. Delete or fix old recipes
  3. Run the chef 12 upgrade

Because the old recipes that need fixing aren’t discovered until you attempt the process, I’ll lay out our challenges and solutions at the end.

Upgrade to latest Chef server 11

This was pretty easy. There is a link to the process here. Since they don’t include an example, the command to install the RPM is:

rpm -Uvh --nopostun /path/to/chef-server-<version>.rpm

After this, just follow their instructions. It went smooth without issues, so I won’t re-hash it.

Chef12 upgrade

Chef 12 added a new feature allowing multiple organizations in a single server. The upgrade will prompt you to create an organization, and all your cookbooks, users, nodes, etc will be re-created in it. We only have one organization, and it looks like knife and chef-client just “do the right thing”.

The upgrade documentation is here. When you run chef-server-ctl upgrade, the following happens:

  1. Chef 11 is started, and Chef 12 stopped
  2. All objects (cookbooks, roles, nodes, users, etc) are downloaded
  3. Cookbooks are checked and updated as necessary. Undocumented and hard to troubleshoot exceptions are thrown.
  4. Chef 11 is stopped, and Chef 12 started
  5. All objects are restored using knife restore. More exceptions probably.
  6. The upgrade is complete, but your server cert is not copied over so all your nodes start alarming.

Troubleshooting the exceptions, which was the bulk of our challenges, is covered in a section below. For the server certificate, copy /var/opt/chef-server/nginx/ca/ to /var/opt/opscode/nginx/ca/.

Delete or fix old recipes

You don’t actually find out about a lot of these until you attempt the upgrade. One thing you can do is a knife download onto your workstation so that you can grep through and fix cookbooks. Add these two lines to your knife.rb

repo_mode 'everything'
versioned_cookbooks true

name property in metadata.rb

Surprise! This is a requirement now. The download will check this and freak out if they’re not set. Remember how you downloaded all your cookbooks? Run this

grep -rL "name" * --include="metadata.rb"

Of course, that doesn’t help you if you were using metadata.json. We weren’t much. If you are … good luck.

Includes don’t work during migration

This was a fun one. Since the metadata.rb file is just blindly evaluated, it lets you do things like:

require          File.expand_path('../lib/chef/sugar/version', __FILE__)

During the migration, the cookbooks are parked in ‘/tmp/somethingrandom/cookbooks’ and since ‘/tmp/somethingrandom/lib/chef/sugar/version’ doesn’t exist, the process blows up. Of course, it doesn’t actually tell you that. To be extra helpful, it deletes ‘/tmp/somethingrandom’ before throwing the stack trace, leaving you utterly unable to diagnose. Worked this out using pry. There is a whole subsection on that below.

Troubleshooting the upgrade is a hassle

You run this sort of magic command ‘chef-server-ctl upgrade’ and it just does “things” for you. If you encounter an error, and you want more details, you’re kind of out of luck. There is no way to pass -VV through into knife, which it’s calling behind the scenes.

To work around this I installed auditd (‘yum install audit’ or ‘apt-get install auditd’) and ran with these rules:

-a exit,always -F arch=b64 -F euid=0 -S execve
-a exit,always -F arch=b32 -F euid=0 -S execve

This let me actually see what chef-server-ctl upgrade was running so I could step into a specific place. You’ll get some examples of how this helped below.

Downloading the cookbooks times out

The first part of the upgrade process is to download all the cookbooks. During the download you might experience: ERROR: Gateway Time-out

What’s happening behind the scenes is that chef-server-upgrade has started a knife download, and that download times out. Knife supports resume, but chef-server-ctl creates a random directory for every run so knife never sees the existing files. It’s a good idea to specify a download directory when starting chef-server-ctl. Ultimately, the command we ran is:

chef-server-ctl upgrade --org-name operations --full-org-name " Operations" \
--user robertlabrie -d /tmp/chef11-data -e /tmp/chef12-data --yes

It won’t create those directories for you, make sure you do it first.

Pry is your friend

The download crashes, the upload crashes, the cookbook update crashes, it doesn’t tell you where or why, it deletes the temp directory, and since the work is done in parallel, the output doesn’t relate to what actually blew up. So what do you do? Well if you were using auditd, you got some idea if the knife commands being run. Here is the one that does the uploads:

knife ec restore --skip-useracl --with-user-sql --concurrency 1 \
-c /tmp/knife-ec-backup-config.rb /tmp/chef12-data/ -VV

Now at least you can test the restore without having to run the whole pipeline, but it doesn’t help you with the meaningless output from knife. This is where pry and pry-rescue comes in. First install the gems for Chefs embedded ruby:

/opt/opscode/embedded/bin/gem install pry pry-rescue

Then re-run knife using rescue instead of ruby

/opt/opscode/embedded/bin/rescue /opt/opscode/embedded/bin/knife \
ec restore --skip-useracl --with-user-sql --concurrency 1 \
-c /tmp/knife-ec-backup-config.rb /tmp/chef12-data/ -VV

Now you get to do useful things like dump ‘@inferred_cookbook_name` and cookbook_version (if your metdata specified one) so that you can figure out whats going on, or at least where it’s going on.

The end?

I’m writing this a few days after completing the project. There may be other gotchas, but the thing is working now and we’re back to pushing roles, nodes and cookbooks.


Why I like SNI

This is the case I made to reduce the complexity of mutlihomed web servers by running a multi-tenant configuration with SNI. Yes, I glossed over some areas, but the problems I express and the proposed solution is correct.

When the internet was young, the topology was simple. A browser connected to a host, requested a resource, and the host returned it.
Hosting multiple sites on a single server meant assigning multiple IP addresses to that server. While this solution funcions, it quickly becomes unsustainable.
Modern web applications depend on external resources like databases and storage. A server with multiple IP addresses may use any one of them to initiate connections. If even one of these addresses is misconfigured or is blocked by an internal firewall that address will not be able to reach the requested resource. The result could be frustrating intermittent connectivity issues that are difficult to diagnose. Worse still, there is no set standard for tracking static IP addresses, it’s usually a spreadsheet. That means the IP you think you’re reserving for your website could already be in use by other servers, other hardware, or anything at all. The complexity is compounded by adding more servers and environments. No matter how careful you are with tracking and change management, this risk is real, and managing it is cumbersome and time consuming.
Fortunately HTTP 1.1 helped resolve that problem. By adding the host to the request header, a server could run multiple sites on a single IP, and let the web server direct traffic to the correct application.
It’s just not always that easy. Unencrypted traffic is inherently insecure. SSL, and now TLS is used to encrypt traffic between the client and server. The HTTP payload is encrypted, which means it can’t be used to direct traffic to the correct application. The host doesn’t know where to direct the request. What can be done? Certainly you want to avoid all the risk and complexity and chaos of managing multi-homed servers, but the data still needs to be secure.
The answer is Server Name Indication or SNI. SNI adds the server name field to the TLS handshake so that the destination server knows precisely for which virtual host the traffic is intended. The server must still present a trusted certificate for the requested host, but it can now support multiple hosts. The resulting simplified network topology reduces the risk that a misconfiguration could impact access to internal resources.
SNI is a widely adopted proposed standard from the IETF and defined in RFC 3546 and RFC 6066. It is supported by modern operating systems browsers web servers and network appliances.

It is not supported by Internet Explorer running on Windows XP. Microsoft wrote their own implementation of SSL, which is good because Windows was not affected by heart-bleed or other vulnerabilities in OpenSSL, but it’s also a hindrance because the company chose not to back port the feature to a decade old version of Windows. Windows XP users can use Firefox or Chrome, which supports SNI through OpenSSL or upgrade to Windows Vista or higher which does support SNI. There is simply no support for IE on XP, a 13 year old out dated and unsupported operating system.
Hope is not lost however. Using a modern load balancer, web applications can present multiple IP addresses to the public internet, while using SNI for inside communication. This solution balances the need to support the widest possible user-base, without hampering  developers and operations teams with managing an unwieldy and complex network topology.

RavenDB operations practices – Part 1

At The Network, we use RavenDB, a document database written in C#. You can think of it as being similar to MongoDB, but with an HTTP REST API. I’m not a developer at The Network, but I am responsible for operations of the GRC Suite, and this includes RavenDB. What follows are are collection of my notes and experience running RavenDB.

Running under IIS

We run RavenDB as an IIS application. It simplifies management and provides great logs and performance data.

Placement of RavenDB data

Do not keep RavenDB data under the website directory.

The default location of RavenDB data is ~\Databases, which places the data under the website. The problem with this is that when IIS detects a large number of changed static files (such as when you are updating indexes), it will recycle the application pool, and that causes all the databases to unload. So for this reason, we keep all our data, including the system database, outside the application directory.

Application pool settings

Do not let anything recycle the Application Pool

This means:

  • Disable automatic AppPool recycling (by time and memory usage)
  • Disable the WAS ping
  • Disable the idle timeout

Any of these settings which causes the AppPool to recycle will cause all your databases to unload.

Do not enable more than one worker process

This is the default setting and must not be changed. If there is more than one worker process, RavenDB will fight for access to the ESENT database.

Disable overlapping recycle

This is a great feature for websites, since it effectively lets the new process start handling requests while the old process is still completing existing ones. For RavenDB it’s a bad thing, for the same reason as enabling more than one worker process. You wan to avoid an AppPool recycle, but if it happens, you don’t want overlapping.

Disable shutdown Time Limit

Or at least increase it from the default of 90 seconds. This setting tells IIS to kill a process if it hasn’t responded to a shutdown before the time limit expires. When shutting down, Raven is cleanly stopping and unloading databases. If the process is killed, starting a DB will require recovery (done automatically) which just slows down the startup process.

Backing up RavenDB

RavenDB includes a number of backup options including an internal backup scheduler, and an external tool called smuggler. We have 100’s of databases, and needed to take backups every two hours, so we decided to use our SAN to take snapshots.

RavenDB is backed by ESENT, which has ACID transactions and robust crash recovery. Taking snapshots can leave data in an inconsistent state, and the ESENT utility is used to cleanup the DB. Three things must be done in order:

  1. recover – which uses logs to return the DB to a clean shutdown state.
  2. repair – which cleans up corruption. Running repair before recover will result in data loss.
  3. defrag – compacts the database and also repairs indexes
C:\Windows\system32\esentutl.exe /r RVN /l logs /s system /i
C:\Windows\system32\esentutl.exe /p /o Data
C:\Windows\system32\esentutl.exe /d Data

Part 2

I’m currently working on clustering, sharding and authentication for RavenDB. I’ll post a part 2 when those are figured out.

Building dashboards with splunk, twig and bootstrap

There won’t be much code with this one because it was an internal project, but it’s been interesting enough that I wanted to do a post. We’re using a monitoring package called AppManager by Zoho corp. It uses a hub-and-spoke architecture with remote “managed servers” rolling data up to a “management server”. Monitors are configured on “managed servers” to monitor systems at their local site. Individual monitors are placed into groups, and those groups can be further grouped, so that alarm states bubble up through the management groups.

My challenge was to create a simple stoplight dashboard showing the ships and the status of various monitor groups. Management wanted it to look like the Twillio service status dashboard. The software has a decent REST API returning JSON or XML. The initial dashboard was a snap, crawl the JSON returned and populate an HTML table. This was so easy that I decided it was time to learn Twig.

Twig is actually pretty slick, and I can easily see the benefits of using a template engine. I re-worked my code to populate an array of data, and passed that data into the twig render function. Twig lets you nest templates inside other templates, and pass data down to the “child”. It’s pretty slick, and I think going forward all my future work will be run through it.

I also wanted to play with Bootstrap, since my UIs are usually pretty bad. Bootstrap is super easy, looks great and is well documented. I’ve officially said goodbye to jQuery UI and hello to bootstrap.

The problem with AppManager is that the management server doesn’t keep monitor data, only monitor status. The other problem is that their dashboards aren’t very pretty. We already have a Splunk installation, so I figured this was a good time to play with Splunk.

I installed nodeJS on every ship, along with the splunk forwarder. At regular intervals the nodeJS script runs against the local AppManager server, gets details about specific monitors (anything in a group called ‘splunkforward’) and writes the JSON data out to the filesystem. The Splunk universal forwarder then pics up those files and sends them to the indexer. Splunk parses JSON data pretty well, and we have a Splunk wizard on site to help carve it up. Splunk also has great graphing features which did a lot of my work for me.

Finally, I wanted to get the monitor details back into my dashboard. Splunk has a PHP SDK which lets you easily retrieve saved queries and re-execute them. The query returns a job ID, and you can either go into a polling loop checking it’s status, or just execute it in blocking mode. Since the data I fed into Splunk initially was just JSON data, this is what I get back out. Those JSON documents can then be parsed with json_decode.

The Splunk bit was also really exciting to me. The alternative would be to parse the data and write it out to some RDBMS and query it out with SQL. I’m learning the data as I work through the project, am changing it, adding fields on the fly, and dealing with some differences in the JSON layout from AppManager (based on monitor type). Using Splunk has freed me from having to battle with SQL. I just feed it JSON files, query them out later and re-parse them before feeding them back into Twig. Fun!

This whole thing might seem overly complex, but consider that it’s expanded out for 24 ships, and each ship is connected with a high latency, low bandwidth satellite link which will occasionally fail. Splunk provided guaranteed delivery of data, and a convenient way to store and access it.

The finished product!

The finished product!


The path for getting AppManager data from the ship to HQ and finally a dashboard.

The path for getting AppManager data from the ship to HQ and finally a dashboard.


Showing a higher level and to ships sending data. In reality it's 24.

Showing a higher level and to ships sending data. In reality it’s 24.

Git for Windows

Git is the popular source control tool which has become the darling of the internet. Unlike subversion, git keeps a local copy of the repository, allowing offline commits. This makes it a lot harder to do a full Windows port. Luckily, there is a package called msygit, which has all the dependencies bundled with it.

The default is to run git in a BASH prompt, but I can’t imagine why you would want to do that. If I’m running on the windows command prompt for other stuff, I don’t want to switch out of it into some hackish BASH prompt to run git. I just switch that option and let the good times roll.

Line endings are a little more complication. I use the default “check out Windows style, check in Unix style”, but I’m thinking to fine tune that a bit. Any good Windows text editor can handle Unix line breaks. If I’m doing a module for MediaWiki or Drupal, I might force it to stay in Unix mode. Conversely, if I’m doing something very windows-centric like ADSI or WMI scripts in PHP (or even VBS), then I’ll want it Windows style on the remote side. The idea is that if someone were to download a ZIP of a repo from git hub, they should be able to view that file using the native console tool (cat or type), and not have some crazy line breaks. More information about handling it is in the GitHub help article “Dealing with line endings“.

One other small thing is that on Windows systems with an NTLM proxy (looking at you TMG), you need to specify the proxy settings in the environment variable http_proxy and https_proxy. Setting it in http.proxy didn’t seem to do the job. I found a bunch of posts dealing with this, and having something to do with git not calling –proxy-ntlm on cURL and it being a hassle to override. I wimped out and used the environment variable.