Using Amazon Lambda with API Gateway

Amazon Lambda is a “serverless compute” service from AWS that lets you run Nodejs, Python, or other code. They have extensive documentation, but I found a disconnect: I was struggling to get the API gateway to map into Lambda functions. I put together a YouTube video, but some of the highlights are below. The source code for the project is on GitHub.

 

Lambda

The Lambda function I demo is very simple, returning a URL based on a given input. When creating it, filter for “hello-world”, and adjust the following:

Name bsdfinder
Description BSD Finder
Role lambda_basic_execution

For the others, just accept the defaults. The IAM Role can be tweaked to allow your lambda function to access S3 as well.

Your method needs to have a event and context objects. The event object is how properties are fed into your function from the API Gateway, and the context object is how execution is stopped and response values passed out.

Test it

Once it’s saved, click Actions → Configure test event. Use this test data:

{
  "bsd": "open"
}

The execution should succeed and the results look like:

{
  "code": 302,
  "location": "http://www.openbsd.org"
}

If you got this far, the lambda function works.

API Gateway

The API Gateway fronts requests from the internet and passes them into the Lambda function, with a workflow:

Method Request → Integration Request → Lambda → Integration Response → Method Response

Method Request Specify what properties (query string, headers) will be accepted
Integration Request Map method request properties to the lambda fucntion
Lambda Properties passed into event object, passed out with context object
Integration Response Map lambda context properties to response
Method Response Response finally gets returned to UA

 

I deliberately kept this document thin. Check out the YouTube video to see it all come together.

Upgrading chef server 11 to version 12

This is the story of how we upgraded Chef server 11 to Chef server 12, open source, on CentOS 6.

 

Step 0: Work off a snapshot of your server. I can’t stress this enough. You’re going to have issues. Use a snapshot.

The process we followed breaks down into three basic phases:

  1. Upgrade to the latest Chef Server 11
  2. Delete or fix old recipes
  3. Run the chef 12 upgrade

Because the old recipes that need fixing aren’t discovered until you attempt the process, I’ll lay out our challenges and solutions at the end.

Upgrade to latest Chef server 11

This was pretty easy. There is a link to the process here. Since they don’t include an example, the command to install the RPM is:

rpm -Uvh --nopostun /path/to/chef-server-<version>.rpm

After this, just follow their instructions. It went smooth without issues, so I won’t re-hash it.

Chef12 upgrade

Chef 12 added a new feature allowing multiple organizations in a single server. The upgrade will prompt you to create an organization, and all your cookbooks, users, nodes, etc will be re-created in it. We only have one organization, and it looks like knife and chef-client just “do the right thing”.

The upgrade documentation is here. When you run chef-server-ctl upgrade, the following happens:

  1. Chef 11 is started, and Chef 12 stopped
  2. All objects (cookbooks, roles, nodes, users, etc) are downloaded
  3. Cookbooks are checked and updated as necessary. Undocumented and hard to troubleshoot exceptions are thrown.
  4. Chef 11 is stopped, and Chef 12 started
  5. All objects are restored using knife restore. More exceptions probably.
  6. The upgrade is complete, but your server cert is not copied over so all your nodes start alarming.

Troubleshooting the exceptions, which was the bulk of our challenges, is covered in a section below. For the server certificate, copy /var/opt/chef-server/nginx/ca/ to /var/opt/opscode/nginx/ca/.

Delete or fix old recipes

You don’t actually find out about a lot of these until you attempt the upgrade. One thing you can do is a knife download onto your workstation so that you can grep through and fix cookbooks. Add these two lines to your knife.rb

repo_mode 'everything'
versioned_cookbooks true

name property in metadata.rb

Surprise! This is a requirement now. The download will check this and freak out if they’re not set. Remember how you downloaded all your cookbooks? Run this

grep -rL "name" * --include="metadata.rb"

Of course, that doesn’t help you if you were using metadata.json. We weren’t much. If you are … good luck.

Includes don’t work during migration

This was a fun one. Since the metadata.rb file is just blindly evaluated, it lets you do things like:

require          File.expand_path('../lib/chef/sugar/version', __FILE__)

During the migration, the cookbooks are parked in ‘/tmp/somethingrandom/cookbooks’ and since ‘/tmp/somethingrandom/lib/chef/sugar/version’ doesn’t exist, the process blows up. Of course, it doesn’t actually tell you that. To be extra helpful, it deletes ‘/tmp/somethingrandom’ before throwing the stack trace, leaving you utterly unable to diagnose. Worked this out using pry. There is a whole subsection on that below.

Troubleshooting the upgrade is a hassle

You run this sort of magic command ‘chef-server-ctl upgrade’ and it just does “things” for you. If you encounter an error, and you want more details, you’re kind of out of luck. There is no way to pass -VV through into knife, which it’s calling behind the scenes.

To work around this I installed auditd (‘yum install audit’ or ‘apt-get install auditd’) and ran with these rules:

-a exit,always -F arch=b64 -F euid=0 -S execve
-a exit,always -F arch=b32 -F euid=0 -S execve

This let me actually see what chef-server-ctl upgrade was running so I could step into a specific place. You’ll get some examples of how this helped below.

Downloading the cookbooks times out

The first part of the upgrade process is to download all the cookbooks. During the download you might experience: ERROR: Gateway Time-out

What’s happening behind the scenes is that chef-server-upgrade has started a knife download, and that download times out. Knife supports resume, but chef-server-ctl creates a random directory for every run so knife never sees the existing files. It’s a good idea to specify a download directory when starting chef-server-ctl. Ultimately, the command we ran is:

chef-server-ctl upgrade --org-name operations --full-org-name " Operations" \
--user robertlabrie -d /tmp/chef11-data -e /tmp/chef12-data --yes

It won’t create those directories for you, make sure you do it first.

Pry is your friend

The download crashes, the upload crashes, the cookbook update crashes, it doesn’t tell you where or why, it deletes the temp directory, and since the work is done in parallel, the output doesn’t relate to what actually blew up. So what do you do? Well if you were using auditd, you got some idea if the knife commands being run. Here is the one that does the uploads:

knife ec restore --skip-useracl --with-user-sql --concurrency 1 \
-c /tmp/knife-ec-backup-config.rb /tmp/chef12-data/ -VV

Now at least you can test the restore without having to run the whole pipeline, but it doesn’t help you with the meaningless output from knife. This is where pry and pry-rescue comes in. First install the gems for Chefs embedded ruby:

/opt/opscode/embedded/bin/gem install pry pry-rescue

Then re-run knife using rescue instead of ruby

/opt/opscode/embedded/bin/rescue /opt/opscode/embedded/bin/knife \
ec restore --skip-useracl --with-user-sql --concurrency 1 \
-c /tmp/knife-ec-backup-config.rb /tmp/chef12-data/ -VV

Now you get to do useful things like dump ‘@inferred_cookbook_name` and cookbook_version (if your metdata specified one) so that you can figure out whats going on, or at least where it’s going on.

The end?

I’m writing this a few days after completing the project. There may be other gotchas, but the thing is working now and we’re back to pushing roles, nodes and cookbooks.

 

Windows CPU monitoring with Splunk

I needed to overcome the Nagios limitation of a slow polling interval for CPU utilization. Splunk provides robust tools for leveraging WMI to monitor windows performance. So you think to yourself: “Great, lets look at CPU usage per process on important servers”, and you bump right up into the catch. Windows perf counters come in two flavors: “raw” and “formatted”, and Splunk only reads the formatted ones. This is fine for most things, but for CPU usage, the data isn’t accurate on multi-core systems. This post covers how to use the raw data, and “cook” the values to match what you’re used to seeing in task manager.

 

CPU utilization on muti-core systems

CPU utilization is measured by % Processor Time over a set period. A mutli-threaded application makes calculating utilization on multi-core systems not as simple 0 to 100 scale. During a time slice, several different threads may have had processor time on a core, or more likely, different threads will have had different utilization on different cores. When calculating CPU usage on a Windows system, the maximum is actually 100% x # of cores.

 

2016-01-06 20_09_15-silo - Remote Desktop Connection

Per core CPU usage average

The formatted CPU usage per process counter doesn’t take this into effect, and the results will not be consistent with what you see in performance monitor or task manager.

 

Cooking the values

So what you actually have to do is take two samples of CPU utilization, determine the elapsed time between them, and calculate the difference. Thanks to a very detailed blog post by Greg Stemp at Microsoft, I worked out that the counter type is PERF_100NSEC_TIMER making the formula:

100* (CounterValue2 - CounterValue1) / (TimeValue2 - TimeValue1)

I’ll get into applying that formula in Splunk in a minute, but first, getting the data in.

 

Get the data in

Like I said, Splunk has great first class Windows support, and has built in support for WMI inputs. It also has support for monitoring performance counters automatically, but since they don’t support RAW values, we’re going to use WMI. So, here it is:

[WMI:process]
index = test
interval = 10
server = RABBIT
wql = Select IDProcess,Name,PercentProcessorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process

The file must go in wmi.conf, it’ll appear in the GUI as a remote performance monitor.

Querying the data

And at long last, show me the chart! First off, a long Splunk query that is going to run off the end of the page (don’t worry, we’ll unpack it and go over it):

earliest=-3m index=test Name!=_Total Name!=Idle | reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name | eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS) | search cputime > 0 AND cputime < 400 |  timechart span=3 avg(cputime) by Name

The pieces explained:

Name!=_Total Name!=Idle
  • _Total and Idle make the time chart look bad, and can be more easily added with an | addtotals
reverse
  • We need the results from oldest to newest, because the calculations are based on a “previous value”
streamstats current=f
last(PercentProcessorTime) as last_PercentProcessorTime
last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS
by Name
  • Streamstats pulls values from the previous record forward. We don’t need the current value, so we drop it.
  • last(PercentProcessorTime) and last(Timestamp_Sys100NS) pull the previous values forward and alias them
  • by Name groups the events together, so that you’re getting the “right” previous PercentProcessorTime. Without it, you would be getting values from some previous process
eval cputime = 100 *
(PercentProcessorTime – last_PercentProcessorTime) /
(Timestamp_Sys100NS – last_Timestamp_Sys100NS)
  • This is the math. It works out the CPU usage between the slices, and the time between the slices
search cputime > 0 AND cputime < 400
  • Throw out anomalous values. * 400 is 100 x the number of cores on this box. Adjust yours accordingly.
timechart span=3 avg(cputime) by Name
  • Make it pretty!

* PercentProcessorTime increments forever, if you plot it on a chart it rises over time. When running the math on a single process, it’s fine, the next value is always higher than the last. There are issues around the transitions though. If the last value of PercentProcessorTime for ProcessA is 40000, and the first value for ProcessB is 10000, then the delta is 30000 and it throws the graphs way out of whack. The easiest thing to do is just drop those values. We’re going to pipe the values through an avg() in timechart anyway, so little gaps like that will get smoothed out.

The results

I used a little app called CpuKiller to throttle the CPUs and make the results more obvious.

 

2016-01-06 15_39_39-rabbit - Remote Desktop Connection

Windows performance monitor Green – CPU killer Red – VMWare

 

2016-01-06 15_39_53-Photos

The same data plotted in Splunk. Note that Splunkd doesn’t appear in the perfmon graph because I didn’t include that process, but the WQL being run by Splunk watches all processes.

 

Some of the fun things you can do with this are look at processes from a number of servers across an entire farm (for example, your web servers), or plot application response times against resource utilization. Yay for Splunk!

CoreOS on VMWare using VMWare GuestInfo API

CoreOS bills itself as “Linux for Massive Server Deployments”, but it turns out, it’s excellent for smaller deployments as well. CoreOS was intended to run in big cloud providers (Azure, AWS, Digital Ocean, etc) backed by OpenStack. I’ve been running CoreOS for a while now, on premises in VMWare. It is kind of a pain: clone the template; interrupt the boot to enable autologin, set a password for “core”; reboot again; re-mount the disk rw; paste in a hand crafted cloud-config.yml. Fortunately for me (and you!), VMWare has an API to inject settings into guests, and CoreOS has added support for using those settings in coreos-cloudinit. Information about the supported properties is here.

My new process is:

  1. Clone the template to a new VM
  2. Inject the config through the guest vmx file
  3. Boot the VM

It’s all automated though, so mostly I get other productive work or coffee drinking done while it’s conjured into existence.

The short version is that you base64 encode and inject your cloud-config, along with any additional parameters. For me it looks like:

Property Value Explanation
hostname coreos0 The hostname, obviously 🙂
interface.0.name ens192 The name of the interface to assign id #0 for future settings. The kernel names the first nic ens192 in VMWare.
interface.0.role private This is my internal (and only) NIC.
interface.0.dhcp no Turn off DHCP. I like it on, most like it off.
interface.0.ip.0.address 192.168.229.20/24 CIDR IP address
interface.0.route.0.gateway 192.168.229.2 Default gateway
interface.0.route.0.destination 0.0.0.0/0 Default gateway
dns.server.0 192.168.229.2 DNS server
coreos.config.data [some ugly string of base64] Base64 encoded cloud-config.yml
coreos.config.data.encoding base64 Tells cloudinit how to decode coreos.config.data

Ok, so this is all great, but how to you get it into VMWare? You have two choices, you can manually edit the .VMX file (boring!) or you can use powershell. The script I use is in github, but the workflow is basically:

  1. Clone the VM
  2. Connect to the datastore and fetch the vmx file
  3. Add the guestinfo to the VMX file
  4. Upload the VMX file back to the datastore
  5. Boot the VM

 

The script will pick up a cloud-config.yml and base64 encode it for you and inject it. Check out the source in github to learn more. If you’re looking at the CoreOS documentation on the VMWare backdoor, you need to put “guestinfo.” infront of all the properties. For example, guestinfo.dns.server.0. The VMWare RPC API only passes properties that start with guestinfo.

This is how it looks when it’s written out to the VMX:

guestinfo.coreos.config.data.encoding = "base64"
guestinfo.interface.0.dhcp = "no"
guestinfo.coreos.config.data = "I2Nsyb2JAd............WJ1bnR1MA0K"
guestinfo.dns.server.0 = "192.168.229.2"
guestinfo.interface.0.ip.0.address = "192.168.229.20/24"
guestinfo.interface.0.route.0.gateway = "192.168.229.2"
guestinfo.interface.0.role = "private"
guestinfo.interface.0.route.0.destination = "0.0.0.0/0"
guestinfo.interface.0.name = "ens192"
guestinfo.hostname = "coreos0"

Wanna see it in action?

Splunk forwarders on CoreOS with JournalD

The team at TNWDevLabs started a new effort to develop an internal SaaS product. It’s a greenfield project, and since everything is new, it let us pick up some new technology and workflows, including neo4j and nodejs. In my role as DevOps Engineer, the big change was running all the application components in Docker containers hosted on CoreOS.

CoreOS is a minimalist version of Linux, basically life support for Docker containers. It is generally not recommended to install applications on the host, which raises a question: “How am I going to get my application logs into Splunk?”.
With a more traditional Linux system, you would just install a Splunk forwarder on the host, and pick up files from the file system.

image00

With CoreOS, you can’t really install applications on the host, so there is the possibility of putting a forwarder in every container.

image02

This would work, but it seems wasteful to have a number of instances of Splunk running on the same host, and it doesn’t give you any information about the host.
CoreOS leverages SystemD which has an improved logging facility called JournalD. All system events (updates, etcd, fleet, etc) from CoreOS are written to JournalD. Apps running inside docker containers generally log to stdout, and all those events are sent to JournalD as well. This improved logging facility means getting JournalD into Splunk is the obvious solution.

The first step was to get the Splunk Universal Forwarder into a docker container. There were already some around, but I wanted to take a different approach. The idea is instead of trying to manage .conf files and getting them into the container, I leverage the Deployment Server feature already built into Splunk. The result is a public image called tnwinc/splunkforwarder. It takes two parameters passed in as environment variables: DEPLOYMENTSERVER and CLIENTNAME. These two parameters are fed into $SPLUNK_HOME/etc/system/local/deploymentclient.conf when the container is started. This is the bare minimum to get the container up and talking to the deployment server.

image03

Setting up CoreOS

Running CoreOS, the container is started as a service. The service definition might look like: ExecStart=/usr/bin/docker run -h %H -e DEPLOYMENTSERVER=splunk.example.com:8089 -e CLIENTNAME=%H -v /var/splunk:/opt/splunk –name splunk tnwinc/splunkforwarder

-h %H Sets the hostname inside the container to match the hostname of the CoreOS host. %H gets expanded to the CoreOS hostname when the .service file is processed.
-e DEPLOYMENTSERVER The host and port of your deployment server
-e CLIENTNAME =%H This is the friendly client name.
-v /var/splunk:/opt/splunk This exposes the real directory /var/splunk as /opt/splunk inside the container. This directory persists on disk when the container is restarted.

Now that Splunk is running and has a place to live, we need to feed data to it. To do this, I setup another service, which uses the journalctl tool to export the journal to a text file. This is required as Splunk can’t read the native JournalD binary format:

ExecStart=/bin/bash -c '/usr/bin/journalctl --no-tail -f -o json > /var/splunk/journald'

This command dumps everything in the journal out to JSON format (more on that later), then tails the journal. The process doesn’t exit, and continues to write out to /var/splunk/journald, which exists as /opt/splunk/journald inside the container.

Also note that as the journald file will continue to grow, I have an ExecStartPre directive that will trim the journal before the export happens:

ExecStartPre=/usr/bin/journalctl --vacuum-size=10M

Since I’m not appending, every time the service starts, the file is replaced. You may want to consider a timer to restart the service on a regular interval, based on usage.

Get the data into Splunk

This was my first experience with Deployment Server, it’s pretty slick. The clients picked up and reached out to the deployment server. My app lives in $SPLUNK_HOME\etc\deployment-apps\journald\local. I’m not going to re-hash the process of setting up a deployment server, there is great documentation at Splunk.com on how to do it.

My inputs.conf simply monitors the file inside the container:

[monitor:/opt/splunk/journald]
sourcetype = journald

The outputs.conf then feeds it back to the appropriate indexer:

[tcpout]
defaultGroup = default-autolb-group
[tcpout:default-autolb-group]
server = indexer.example.com:9997
[tcpout-server://indexer.example.com:9997]

Do something useful with it

None of getting the data into Splunk is special, but the props.conf is fun so I’m covering it separately. Running the journal out in JSON structures the data in a very nice format, and the following props.conf helps Splunk understand it:

[journald]
KV_MODE = json
MAX_TIMESTAMP_LOOKAHEAD = 10
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE = false
TIME_FORMAT = %s
TIME_PREFIX = \"__REALTIME_TIMESTAMP\" : \"
pulldown_type = 1
TZ=UTC
KV_MODE=json Magically parse JSON data. Thanks, Splunk!
TIME_PREFIX This ugly bit of regex pulls out the timestamp from a field called __REALTIME_TIMESTAMP
TIME_FORMAT Standard strpdate for seconds
MAX_TIMESTAMP_LOOKAHEAD JournalD uses GNU time which is in microseconds (16 characters). This setting tells splunk to use the first 10.

Once that app was published to all the agents, I could query the data out of Splunk. It looks great!

image01

This is where the JSON output format from journald really shines. I get PID, UID, the command line, executable, message, and more. For me, coming from a Windows background, this is the kind of logging we’ve had for 20 years, now finally in Linux, and easily analyzed with Splunk.

The journal provides a standard transport, so I don’t have to think about application logging on Linux ever again. The contract with the developers is: You get it into the journal, I’ll get it into Splunk.

etcd 2.0 cluster with proxy on Windows Server 2012

I got interested in etcd while working on CoreOS for a work project. I’m no fan of Linux, but I like what etcd offers and how simple it is to setup and manage. Thankfully, the etcd team produces a Windows build with every release. I created a Chocolatey package and got it published, then used it to setup the cluster. The service is hosted by nssm, which also has a Chocolatey package and the extremely helpful nssm edit tool for making changes to the service config. Etcd has great documentation here.

Etcd is installed by running:

choco install -y etcd --params="<service parameters>"

Where service parameters is:

etcd --name "%COMPUTERNAME%" ^
--initial-advertise-peer-urls "http://%COMPUTERNAME%:2380" ^
--listen-peer-urls "http://%COMPUTERNAME%:2380" ^
--listen-client-urls "http://%COMPUTERNAME%:2379,http://127.0.0.1:2379" ^
--advertise-client-urls "http://%COMPUTERNAME%:2379" ^
--discovery "https://discovery.etcd.io/<your_token_here>" ^
--debug

Or you can install it in proxy mode by running:

choco install -y etcd --params="--proxy on --listen-client-urls http://127.0.0.1:2379 --discovery https://discovery.etcd.io/<your_token_here>"

Proxy mode is especially useful because it lets applications running on a machine be ignorant of the etcd cluster. Applications connect to localhost on a known port, and etcd in proxy mode manages finding and using the cluster. Even if the cluster is running on CoreOS, running etcd in proxy mode on Windows is a good way to help Windows apps leverage etcd.

Watch a demo of the whole thing here:

Unattended Windows install after sysprep

I can’t believe how hard this was to find. You can get an unattend.xml that installs call of duty, but you can’t find one that just does as little as possible. This is it, for Windows 2012. I wanted to setup my base machine (which is very thin), sysprep it, and after cloning it, have it not ask any questions. If anyone finds this useful, please enjoy.

The reference for these files is here.

<!-- c:\Windows\System32\Sysprep\sysprep.exe /oobe /generalize /reboot /unattend:C:\unattend.xml -->
<unattend xmlns="urn:schemas-microsoft-com:unattend"> 
<settings pass="windowsPE"> 
      <component name="Microsoft-Windows-International-Core-WinPE" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
            <SetupUILanguage>
                  <WillShowUI>OnError</WillShowUI>
                  <UILanguage>en-US</UILanguage>
            </SetupUILanguage>
            <UILanguage>en-US</UILanguage>
      </component>
</settings>
<settings pass="specialize">
      <component name="Microsoft-Windows-Shell-Setup" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
            <ComputerName>*</ComputerName>
      </component>
</settings>
<settings pass="oobeSystem">
      <component name="Microsoft-Windows-Shell-Setup" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" processorArchitecture="amd64">
            <OOBE> 
                  <HideEULAPage>true</HideEULAPage> 
                  <NetworkLocation>Work</NetworkLocation> 
                  <ProtectYourPC>3</ProtectYourPC>
                  <SkipMachineOOBE>true</SkipMachineOOBE> 
                  <SkipUserOOBE>true</SkipUserOOBE>
                  <HideLocalAccountScreen>true</HideLocalAccountScreen>
            </OOBE>
            <UserAccounts>
                <AdministratorPassword>
                    <Value><passwordgoeshere></Value>
                    <PlainText>true</PlainText>
                </AdministratorPassword>            
            </UserAccounts>
            
      </component>
</settings>
</unattend>