Upgrading chef server 11 to version 12

This is the story of how we upgraded Chef server 11 to Chef server 12, open source, on CentOS 6.


Step 0: Work off a snapshot of your server. I can’t stress this enough. You’re going to have issues. Use a snapshot.

The process we followed breaks down into three basic phases:

  1. Upgrade to the latest Chef Server 11
  2. Delete or fix old recipes
  3. Run the chef 12 upgrade

Because the old recipes that need fixing aren’t discovered until you attempt the process, I’ll lay out our challenges and solutions at the end.

Upgrade to latest Chef server 11

This was pretty easy. There is a link to the process here. Since they don’t include an example, the command to install the RPM is:

rpm -Uvh --nopostun /path/to/chef-server-<version>.rpm

After this, just follow their instructions. It went smooth without issues, so I won’t re-hash it.

Chef12 upgrade

Chef 12 added a new feature allowing multiple organizations in a single server. The upgrade will prompt you to create an organization, and all your cookbooks, users, nodes, etc will be re-created in it. We only have one organization, and it looks like knife and chef-client just “do the right thing”.

The upgrade documentation is here. When you run chef-server-ctl upgrade, the following happens:

  1. Chef 11 is started, and Chef 12 stopped
  2. All objects (cookbooks, roles, nodes, users, etc) are downloaded
  3. Cookbooks are checked and updated as necessary. Undocumented and hard to troubleshoot exceptions are thrown.
  4. Chef 11 is stopped, and Chef 12 started
  5. All objects are restored using knife restore. More exceptions probably.
  6. The upgrade is complete, but your server cert is not copied over so all your nodes start alarming.

Troubleshooting the exceptions, which was the bulk of our challenges, is covered in a section below. For the server certificate, copy /var/opt/chef-server/nginx/ca/ to /var/opt/opscode/nginx/ca/.

Delete or fix old recipes

You don’t actually find out about a lot of these until you attempt the upgrade. One thing you can do is a knife download onto your workstation so that you can grep through and fix cookbooks. Add these two lines to your knife.rb

repo_mode 'everything'
versioned_cookbooks true

name property in metadata.rb

Surprise! This is a requirement now. The download will check this and freak out if they’re not set. Remember how you downloaded all your cookbooks? Run this

grep -rL "name" * --include="metadata.rb"

Of course, that doesn’t help you if you were using metadata.json. We weren’t much. If you are … good luck.

Includes don’t work during migration

This was a fun one. Since the metadata.rb file is just blindly evaluated, it lets you do things like:

require          File.expand_path('../lib/chef/sugar/version', __FILE__)

During the migration, the cookbooks are parked in ‘/tmp/somethingrandom/cookbooks’ and since ‘/tmp/somethingrandom/lib/chef/sugar/version’ doesn’t exist, the process blows up. Of course, it doesn’t actually tell you that. To be extra helpful, it deletes ‘/tmp/somethingrandom’ before throwing the stack trace, leaving you utterly unable to diagnose. Worked this out using pry. There is a whole subsection on that below.

Troubleshooting the upgrade is a hassle

You run this sort of magic command ‘chef-server-ctl upgrade’ and it just does “things” for you. If you encounter an error, and you want more details, you’re kind of out of luck. There is no way to pass -VV through into knife, which it’s calling behind the scenes.

To work around this I installed auditd (‘yum install audit’ or ‘apt-get install auditd’) and ran with these rules:

-a exit,always -F arch=b64 -F euid=0 -S execve
-a exit,always -F arch=b32 -F euid=0 -S execve

This let me actually see what chef-server-ctl upgrade was running so I could step into a specific place. You’ll get some examples of how this helped below.

Downloading the cookbooks times out

The first part of the upgrade process is to download all the cookbooks. During the download you might experience: ERROR: Gateway Time-out

What’s happening behind the scenes is that chef-server-upgrade has started a knife download, and that download times out. Knife supports resume, but chef-server-ctl creates a random directory for every run so knife never sees the existing files. It’s a good idea to specify a download directory when starting chef-server-ctl. Ultimately, the command we ran is:

chef-server-ctl upgrade --org-name operations --full-org-name " Operations" \
--user robertlabrie -d /tmp/chef11-data -e /tmp/chef12-data --yes

It won’t create those directories for you, make sure you do it first.

Pry is your friend

The download crashes, the upload crashes, the cookbook update crashes, it doesn’t tell you where or why, it deletes the temp directory, and since the work is done in parallel, the output doesn’t relate to what actually blew up. So what do you do? Well if you were using auditd, you got some idea if the knife commands being run. Here is the one that does the uploads:

knife ec restore --skip-useracl --with-user-sql --concurrency 1 \
-c /tmp/knife-ec-backup-config.rb /tmp/chef12-data/ -VV

Now at least you can test the restore without having to run the whole pipeline, but it doesn’t help you with the meaningless output from knife. This is where pry and pry-rescue comes in. First install the gems for Chefs embedded ruby:

/opt/opscode/embedded/bin/gem install pry pry-rescue

Then re-run knife using rescue instead of ruby

/opt/opscode/embedded/bin/rescue /opt/opscode/embedded/bin/knife \
ec restore --skip-useracl --with-user-sql --concurrency 1 \
-c /tmp/knife-ec-backup-config.rb /tmp/chef12-data/ -VV

Now you get to do useful things like dump ‘@inferred_cookbook_name` and cookbook_version (if your metdata specified one) so that you can figure out whats going on, or at least where it’s going on.

The end?

I’m writing this a few days after completing the project. There may be other gotchas, but the thing is working now and we’re back to pushing roles, nodes and cookbooks.