Thursday, June 29, 2017

Lessons Learned with Logstash - Part II

In Lessons Learned with Logstash (so far) I went into the some of the problems and challenges of Logstash. Since then, I've discovered what I refer to as the 'chasm' between an out-of-the-box Elasticsearch solution, and a large-scale, full-blown enterprise-level implementation. TLDR; if you're going to use ES out-of-the-box, don't expect to use it like that 'at-scale' for very long.

The Chasm - A field mapping explosion!

The chasm between simple-ES and Hawking-complicated-ES is what is known in the documentation as a 'field mapping explosion'. In a sweet turn of irony, that which draws most newcomers to ES is exactly what will cut them down at certain scale of logging without a major refactoring of their understanding of the stack.

side rant: I suspect this is why most blog articles related to syslogging with ES really only focus on a shallow treatment of Logstash... and never get to Elasticsearch itself- ES is complicated. Sadly this group doesn't just include personal bloggers like yours truly, but commercial vendors (traditional log handling application vendors) vying for a piece of the ES pie -I find their lackluster attempts to market their ES chops especially deplorable. If ES is being used by Netflix and Ebay and Wikipedia, then it stands to reason that the average toe-dipping blog article is wholly insufficient for, well, just about anything seriously related to ES (this article included!).

Back to business: Dynamic field mapping is a feature of ES to bring neophytes in, because it attempts to automatically create fields based on the data being fed to it, without any configuration or interaction by the administrator. Literally feed it a stream or store of information, and ES will get down to work creating all the fields for you, and then populating them with data. Sounds fantastic! What could possibly go wrong?

The Two Strikes Against ES

Unfortunately the key word in all of this isn't 'dynamic' or 'automatically', but rather the word 'attempt', because depending on the data being fed to it, it can either shine or crash and burn. Both are endings bright, but one isn't happy. To help explain, it's useful to know the two strikes against Elastic putting this feature into ES-

  1. Those that rely the most on dynamic mapping are those most likely to suffer from it (newbies).
  2. One of the best use-cases of ES creates an embarrassingly hot mess of an ELK stack when using dynamic field mapping (syslog).

I know this because I fell into both categories- a newbie using ES for syslog analysis.

How does it (not) work?

So what happens? At it's core, ES suffers from separation anxiety- but in the opposite manner we're used to. Instead of being clingy it tries to separate lots of information, and when it's syslog data, much it shouldn't be separated. Specifically, the hyphen signals to ES that whatever is on either side of it should be separated. In the world of syslog, that usually includes hostnames, and dates and times.

If it seems like a glaring issue, that's because it is. Your ES index will be littered with faulty fields, and those fields will contain data, both of which should've been data somewhere else you specifically needed it to be. That's bad when you're using ES to find needles in your haystack, and some needles get permanently lost in those bogus fields.

Your system will start to churn on itself. At one point I thought I needed around 500 to 600 fields. I had 906. Then I had 3342, and that was effectively data corruption on a mass scale (the system takes in ~4 million events per day). It made searches against appropriately stored information undeniably slow.

Simply crafting detailed logstash rules with plenty of fields guarantees nothing. If ES sees some data as a field, it can wreck your grok-parse rules and leave you wondering why they don't work when they once did. Add to this the variability of Unicode amongst your data sources (Juniper and MS-Windows are  UTF-16, Nutanix and Fortinet are UTF-8, for instance) and you can spend a great deal of time hunting down your parsing issues, all while munging your data into oblivion.

The Solution

The solution is to prepare ES ahead of time for exactly what it should expect, using manual field mappings. Via the API, you can turn off dynamic mapping and manually map every field you will be using. The added benefit of this work is that at while defining the fields you set the type of data ES should expect per field. When tuned properly via appropriate data types, ES can index and search significantly faster than if it's left with the default field mapping definition. Speaking of which, Logstash will ship in fields that are of type string. This almost necessitates a static mapping exercise, as there performance and capability penalties for incorrectly typing your data.

Friends with Benefits

For instance, if IP addresses are properly typed as such, then you can do fancy things like search through CIDR subnets and perform IP space calculations. As another example, properly typing number data as integers (whole or floating, of various maximum sizes) allows ES to perform math functions on them, whereas left as the default (type string) will never give that ability. A static field mapping fed to ES ahead of document ingestion will ensure that the right field typing is applied to the incoming data to avoid disaster, wasted CPU cycles, and losses of extended functionality.

How-to, Next Time

In the next article I'll go over the gory details of how to properly define the fields you intend to use. At a basic level, you use strict LS filter rules along with the dynamic field mapping capability to see what fields come across from LS to ES. It allows you to tweak your filter rules until only the fields you want are traversing the stack. From there you feed them back in as a template (along with changes to the field types) and voila! ES and LS suddenly seem to get along with each other.

SPECIAL BONUS TIP: in part one I talked about breaking up your logging devices by port and type. Along with that do not forget to conditionally filter each set of device types via their assigned type before parsing that section of rules. I forgot to do this and ended up having my Nutanix logs parsed by my Nutanix grokparse filters... and then by my Fortinet grokparse filters, leading to all sorts of chaos. Start your constituent logstash configuration files (the ones I described previously as being dedicated to a particular device/vendor/application) with this (using Nutanix as the example):

if [type] == "nutanix" { 

   {nested nutanix-filter_01}
   .
   .
   .
   {nested nutanix-filter_99}
}



1 comment:

Anonymous said...

In the next article I'll go over the gory details of how to properly define the fields you intend to use. At a basic level, you use strict LS filter rules along with the dynamic field mapping capability to see what fields come across from LS to ES. It allows you to tweak your filter rules until only the fields you want are traversing the stack. From there you feed them back in as a template (along with changes to the field types) and voila! ES and LS suddenly seem to get along with each other.


You mean, like this? https://www.elastic.co/blog/logstash_lesson_elasticsearch_mapping