Splunk user agent string lookups with TA-browscap_express

I got a requirement to find out what browsers our clients are using. We run a SaaS product, and every client is clientname.ourdomain.com, so I could use the cs_hostname field in the log. Using a 3rd party analytic tool was totally out of the question, all I had to go on were the IIS logs.

We’re already getting the IIS logs into Splunk, so with a bit of Googling I found the TA-browscap app by Dave Shpritz. It’s powered by the browscap project and it works. The problem is that the browscap file is now 18MB and searching it has become very slow. What started as an hack to cache matches in a separate file has turned into a total fork and re-write of most of the app, and has become TA-browscap_express.

There are installation instructions on the application page at Splunk.com, also in the GitHub repo, so I won’t rehash them here.

The Browscap file

The Browser Capabilities Project (browscap) is an effort to identify all known User Agent (UA) strings, which regretfully are a total mess. The project is active, and the data is accurate. They provide the data in a number of formats, the legacy INI file still used by PHP and ASP, and a CSV file, among others. The file is 18MB and 58,000 lines long.

The structure of the file is a name pattern for a UA string, followed by all the known properties. My UA string is:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0

And the matching name pattern is:

Mozilla/5.0 (*Windows NT 6.1*WOW64*) Gecko* Firefox/31.0*

The example above matches FireFox 31 on Windows 7 x64. Now here is an interesting challenge:

Mozilla/5.0 (*Windows NT 6.1*) Gecko* Firefox/31.0*

This name pattern matches FireFox 31 ono Windows 7 x86. It also matches x64. If you take the first match, you’ll get the wrong information. To get an accurate lookup, you need to compare all 58,000 name patterns, and the longest one which matches is the most correct. You can imagine, this is quite a challenge.

Parsing the browscap.csv file

The TA-browscap app uses pybrowscap, which is a Python library for parsing and managing the browscap.csv file. The library returns an object with properties for all the fields in the browscap file. I didn’t want to check 58,000 name patterns every time, so I wanted the successful pattern as well. pybrowscap doesn’t provide it, and it’s actually hard to re-create because they’re using python’s internal CSV buster.

The solution was to lift the core logic from pybrowscap and re-write it myself, busting strings as CSV data instead of files. The first thing you have to do is convert the name pattern into regex, which is easy, then compare your challenge string against it. Like I described above, after that  you loop through every name pattern until you find the longest match.

Knowing what to cache

The last entry in the file is “*” which will match anything. It returns a set of properties called “Default Browser” where everything is false. The idea is you’ll always get some response, you won’t get null. I didn’t want to cache these “Generic” or “Default” browsers, because once they’re in the cache, they’ll come up for every new UA string, and the data will be junk.

How it works

The app (TA-browscap_express) caches matched UA strings in a file and searches it first. During a query it also keeps matches in memory, using the memory cache before the disk cache. It also supports blacklisting obviously bad UA strings and storing the cache file on a network share to help with distributed search.

When a UA string is passed into the app, it checks 4 things:

  1. Is it blacklisted? If yes, return default browser.
  2. Is it in the memory cache? If yes, return the entry.
  3. Is it in the browscap_lite.csv file? If yes, add to memory cache and return.
  4. Is it in browscap.csv? If yes, add to browscap_lite.csv, the memory cache and return.
  5. If totally unidentifiable, return the default browser.

 

Browscap_lite.csv

The browscap_lite.csv file is the cache file which is checked before browscap.csv. It’s in the same format, and has the same fields. Matched UA strings are written to it.

The default location for the file is in the application\bin directory for the app. In a Splunk distributed environment, that’s not really a good idea. You never know what search head or indexer the app will run on, and you’ll end up rebuilding the cache over and over. The browscap_lookup.ini file lets you specify a location for the file.

The blacklist

Some UA strings are junk, and won’t ever be added to the browscap file. Others could be added, and I suggest you report any new strings to browscap.org, but it may take a while, or you just don’t care. The blacklist.txt file is used to weed out garbage UA strings so that you don’t waste time trying to look them up in the browscap.csv file, only to get “*” Default Browser.

 

The fields

The app returns all fields available in the browscap file. Two I find especially interesting are:

ua_comment This is a combination of the browser name and version, so that Internet Explorer 11 becomes IE 11.

ua_platform This is a combination of the operating system and version, so that Windows 7 becomes Win7.

 

There is one additional field which I added, ua_fromcache, which returns true, false, or blacklist, depending on where the data came from.

 

The demo

Sorry in advance for my droning voice, but if you want to see the setup and usage in action, check out my Youtube Video.

 

Advertisements

About robertlabrie
DevOps Engineer at The Network Inc in metro Atlanta. Too many interests to list here, check out my posts, or look me up on LinkedIn

4 Responses to Splunk user agent string lookups with TA-browscap_express

  1. sgundeti says:

    Hi,

    Does this support splunk 5.0.5 version. I am getting below error once I search as specified in usage.

    script for lookup table “browscap_lookup_express” returned error code 1, Results may be incorrect.

    I did installed app as specified and double checked with your video as well.

    Kindly help me.

  2. Hi,

    I am using splunk 5.0.5 and installed TA-Browsecap_express as specified and executed below sample search and got error code.

    I cloned the latest revision from github; not sure what is the issue. Wondering if this app is not supported by splunk 5.0.5.

    Please help

    Search :
    index=butter* sourcetype=access* | eval http_user_agent=urldecode(useragent) | lookup browscap_lookup_express http_user_agent | stats count by ua_comment

    Error:

    Script for lookup table ‘browscap_lookup_express’ returned error code 1. Results may be incorrect.

  3. I also tried with splunk 6.1.4, and got same error.

  4. Mark Hill says:

    Hi, It appears that every search I try is re-scanning the browscap.csv and not using the lite cache. This is a fresh install of the app on two different splunk instances.
    I can see the cache increasing in size as its finding matches. BUT, searches after that are taking so long to complete its crazy. I also haven’t see the ua_fromcache field appear at all.
    Advice? Thanks in advance, awesome app, hope I can fix this issue!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: