Splunk user agent string lookups with TA-browscap_express

I got a requirement to find out what browsers our clients are using. We run a SaaS product, and every client is clientname.ourdomain.com, so I could use the cs_hostname field in the log. Using a 3rd party analytic tool was totally out of the question, all I had to go on were the IIS logs.

We’re already getting the IIS logs into Splunk, so with a bit of Googling I found the TA-browscap app by Dave Shpritz. It’s powered by the browscap project and it works. The problem is that the browscap file is now 18MB and searching it has become very slow. What started as an hack to cache matches in a separate file has turned into a total fork and re-write of most of the app, and has become TA-browscap_express.

There are installation instructions on the application page at Splunk.com, also in the GitHub repo, so I won’t rehash them here.

The Browscap file

The Browser Capabilities Project (browscap) is an effort to identify all known User Agent (UA) strings, which regretfully are a total mess. The project is active, and the data is accurate. They provide the data in a number of formats, the legacy INI file still used by PHP and ASP, and a CSV file, among others. The file is 18MB and 58,000 lines long.

The structure of the file is a name pattern for a UA string, followed by all the known properties. My UA string is:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0

And the matching name pattern is:

Mozilla/5.0 (*Windows NT 6.1*WOW64*) Gecko* Firefox/31.0*

The example above matches FireFox 31 on Windows 7 x64. Now here is an interesting challenge:

Mozilla/5.0 (*Windows NT 6.1*) Gecko* Firefox/31.0*

This name pattern matches FireFox 31 ono Windows 7 x86. It also matches x64. If you take the first match, you’ll get the wrong information. To get an accurate lookup, you need to compare all 58,000 name patterns, and the longest one which matches is the most correct. You can imagine, this is quite a challenge.

Parsing the browscap.csv file

The TA-browscap app uses pybrowscap, which is a Python library for parsing and managing the browscap.csv file. The library returns an object with properties for all the fields in the browscap file. I didn’t want to check 58,000 name patterns every time, so I wanted the successful pattern as well. pybrowscap doesn’t provide it, and it’s actually hard to re-create because they’re using python’s internal CSV buster.

The solution was to lift the core logic from pybrowscap and re-write it myself, busting strings as CSV data instead of files. The first thing you have to do is convert the name pattern into regex, which is easy, then compare your challenge string against it. Like I described above, after that  you loop through every name pattern until you find the longest match.

Knowing what to cache

The last entry in the file is “*” which will match anything. It returns a set of properties called “Default Browser” where everything is false. The idea is you’ll always get some response, you won’t get null. I didn’t want to cache these “Generic” or “Default” browsers, because once they’re in the cache, they’ll come up for every new UA string, and the data will be junk.

How it works

The app (TA-browscap_express) caches matched UA strings in a file and searches it first. During a query it also keeps matches in memory, using the memory cache before the disk cache. It also supports blacklisting obviously bad UA strings and storing the cache file on a network share to help with distributed search.

When a UA string is passed into the app, it checks 4 things:

  1. Is it blacklisted? If yes, return default browser.
  2. Is it in the memory cache? If yes, return the entry.
  3. Is it in the browscap_lite.csv file? If yes, add to memory cache and return.
  4. Is it in browscap.csv? If yes, add to browscap_lite.csv, the memory cache and return.
  5. If totally unidentifiable, return the default browser.

 

Browscap_lite.csv

The browscap_lite.csv file is the cache file which is checked before browscap.csv. It’s in the same format, and has the same fields. Matched UA strings are written to it.

The default location for the file is in the application\bin directory for the app. In a Splunk distributed environment, that’s not really a good idea. You never know what search head or indexer the app will run on, and you’ll end up rebuilding the cache over and over. The browscap_lookup.ini file lets you specify a location for the file.

The blacklist

Some UA strings are junk, and won’t ever be added to the browscap file. Others could be added, and I suggest you report any new strings to browscap.org, but it may take a while, or you just don’t care. The blacklist.txt file is used to weed out garbage UA strings so that you don’t waste time trying to look them up in the browscap.csv file, only to get “*” Default Browser.

 

The fields

The app returns all fields available in the browscap file. Two I find especially interesting are:

ua_comment This is a combination of the browser name and version, so that Internet Explorer 11 becomes IE 11.

ua_platform This is a combination of the operating system and version, so that Windows 7 becomes Win7.

 

There is one additional field which I added, ua_fromcache, which returns true, false, or blacklist, depending on where the data came from.

 

The demo

Sorry in advance for my droning voice, but if you want to see the setup and usage in action, check out my Youtube Video.

 

Advertisements