Centralize Your Logs and They'll Tell You Everything

Nov 12 2014

Logs are everywhere, yet they're rarely used as they should be. Almost every software system is able to write to text log files in a configurable manner. They're plentiful, but we usually only read them when we really need them, like when something is broken and we're trying to figure out why. And even then they are a pain if not impossible to make any sense of; our puny human brains just aren't meant to process thousands of lines of dense text, made up of mostly timestamps and numerical codes (not in a standardized format of course), and make sense of them. And then you step back afterward and say to yourself, "I really don't want to do that ever again." Clearly there is a need to organize this data to solve your problems.

Recently at work I had to implement some kind of logging system. The main goal was to track user activity as we are priming to beta launch our web application. However after an initial discussion between the team, we realized that there were a few distinct and important reasons why we would want to store log data to a central place.

The four main uses of logging in our situation are performance, security, activity and error reporting and if you're running a public facing web applications I imagine they'll be similar for you.

PERFORMANCE

If possible, you should log the amount of time you spend between a request coming in and a response going out in your APIs. In our case my colleague was able to put together a custom package that logs the milliseconds that a request takes within the code. We then chart these queries in our logging system to get a sense of what the slowest request types are and where you should optimize your code.

You can take this a step further by logging database query time since the majority of your latency will probably be because of a database query. Databases like Mongo and mySQL allow for log file output of queries slow than a configurable amount of time.

SECURITY

If you've got a public facing application on the open internet like we are, you're likely to encounter bots, scammers and other nefarious people sniffing around your stuff in ways that you don't want. This is especially the case if you're using a IaaS like AWS that has a well-known IP range. Once we started logging the requests that come in to Nginx, we noticed a bunch of suspicious activity going on. One day I noticed a unique load of traffic coming in over a five minute period. Without getting in to too many details (because I don't think you should with this kind of stuff), we took some action to plug up these issues, making our system more secure and giving the team one less thing to worry about.

ACTIVITY

It's obvious to anyone that has used Google Analytics or another hosted solution before that there are many benefits to logging user activity. While having access to your app's usage in the aggregate is often good enough, it is also beneficial to be able to dive in at a more fine-grained level (with the caveat that you respect your users' privacy).

Sometimes there's a gap between how you think people are using your system and how they really are. What's a better way to observe your users than out in the wild? Perhaps you won't be able to earn an additional $300 million with such a discovery, but it could lead to some interesting conclusions about your system that you wouldn't know otherwise unless you did user testing.

ERROR REPORTING

This one is pretty self-explanatory. One of my favorite reasons to have a centralized logging system has already saved me once: debugging. A few weeks ago one of our partners was having a problem with our application. Without asking her a bunch of questions to get a sense of what was going on before she got the error or even SSHing into a server to view the apps' logs, I was able to diagnosis what the problem was and give her a solution. It felt awesome because it saved me a lot of time and hassle and it only took me a few minutes to give her a solution.

Our implementation

I wanted to write about the benefits of centralized logging before explaining how we did it because in the end, like most technology decisions, you should use whatever works best for you. After performing some research (see the links below) and getting a sense of what open source solutions were out there, we decided on the Elasticsearch/Logstash/Kibana (ELK) stack. The main reason was that we were already familiar with a few of the technologies.

Elasticsearch is a schema-free database that stores documents as JSON and is predictability used to store your log files. The really awesome thing about Elasticsearch though is the comprehensive built-in REST API for performing CRUD operations on your data that comes in handy here.

Logstash facilitates getting the log data spit out by your applications as text files from your servers, parsing them into a useful format and storing them in a database (like Elasticsearch). A separate project, Logstash Forwarder, is meant to be installed on the logging servers and uses the Lumberjack protocol to push the data to your Logstash instance. It's written in Go which means that it can be built and run easily from one executable. No Java required, just give it a config file!

Kibana is a client-side web app written in AngularJS that makes the REST calls to Elasticsearch to visualize your log data in pretty charts. You can create multiple dashboards which store their configuration in the very same Elasticsearch database.

It's a beautiful thing: really easy to setup and hasn't let us down yet. It takes a bit to get used to the Kibana interface to create the custom dashboards that you need, but it's incredibly powerful once you get a hang of it. And it has allowed us to keep on top of our logs in the ways I detailed above.

Overall it has been going well and definitely worth the time spent to setup and maintain. We still have a ways to go before we finish this project though (getting Mongo and mySQL query logs imported) and I'm sure we'll come across a few more discoveries uncovered from the logs along the way.

TIPS FOR LOGGING

In the process of implementing our system I learned a few things that I wish I knew when I started. They may not apply to you depending on what you're running, but here we go:

  • Make it easy to log if you're going to do so in your application code. Ideally it should be a one-liner coming from an isolated module/package so that you have no excuse to not use it everywhere. If possible, make your logging package(s) composable so that they can be reused.
  • Decide on a common scheme for your activity logs. Think element types like "category", "page", "action", etc. so that you can categorize and play with the data in interesting ways later.
  • Include a log level in the events that you write to your log files. Most logging packages no matter the programming language offer this and the levels are usually something like "debug", "info", "warn", "error", and "critical" from least to most severe. Having these levels will help you filter and prioritize your events later when you visualize them.
  • If applicable to your system, include a unique ID for each user session. That way you can track your users through their whole session with only a timestamp.
  • Think about security and don't log anything that you wouldn't want a hacker getting a hold of. Think about if someone was on your server: would you want them to easily view something in a convienent text file that should be encrypted if it was in your database? Also, make sure your logging system isn't available to everyone. Put it behind a VPN if you can and make the website password protected. If you're using Kibana, Nginx will do the trick.
  • Include metadata about your application like the version number and the environment (for example: production, staging, test) with each event that you log. Trust me, this will come in handy later.
  • If you're going to store your logs as JSON, log to your file as JSON in the first place. While Logstash's parsing capabilities are pretty nice, it's still a pain that should be avoided if necessary.
  • Log everything. Hard disk space is plentiful but time to change your logging facilities later is not.

ADDITIONAL RESOURCES

Discuss with me on Sublevel and Twitter.

Send a pull request for this post on GitHub.

Dave Walk is a software developer, basketball nerd and wannabe runner living in Philadelphia. He enjoys constantly learning and creating solutions with Go, JavaScript and Python. This is his website.