Infrastructure monitoring: An introduction
The ability to understand at a glance the current state of your infrastructure is an essential yet often underappreciated aspect of modern infrastructures. Regardless of the architecture, from dockerized microservices to monoliths and physical servers, knowing what’s going on is an essential part of avoiding unexpected downtime.
This introduction is intended for anyone with responsibility for infrastructure who finds themselves faced with the sudden realization that their legacy monitoring system sucks, the need to have excellent monitoring, also known as observability will be explored in this article.
Strategy
Lets say you have inherited a large and complex on-premise set of applications and servers, most likely there is already some sort of monitoring taking place. If it’s older than a few years, don’t be surprised to find unmaintained Nagios / OP5, Cacti (for graphs) servers and some ugly shell scripts running in Cron here and there that send 100s of emails every day. The majority of which go ignored. This is a bad situation to be in but the good news is that we can implement a new and complete monitoring solution along side without having to interfere with the old stuff until it’s no longer required.
The tools
There are a wide array of options available for monitoring of servers and services, here we will introduce three of the most versatile and cost effective of these: Prometheus, Grafana and Loki. Together these free software applications create the foundation of a high quality and reliable monitoring stack.
Prometheus
One of the main pillars of your observability stack is always going to be metric collection and for this we have the excellent Prometheus software.
The monitoring time series database du jour is without a doubt Prometheus, and for good reason. This is an incredibly versatile and efficient system for storing and analyzing massive quantities of metric data (like: CPU usage, invoice totals, number of requests to your webservers). Using PromQL (Prometheus query language) you can quickly generate ad-hoc graphs and isolate possible ongoing problems. If you haven’t seen this kind of system before, it’s mind-blowingly cool to see it up and running.
Prometheus is the name of the project and the main application but the system as a whole is made up of several individual tools. Exporters exist for just about any application you can think of and can be fairly simple to implement yourself where existing support is lacking.
Exporters
One such function (making OS metrics available to Prometheus) is handled by node_exporter. This small application runs on Linux servers and exposes system metrics via an HTTP endpoint which is then scraped on a configurable interval by the Prometheus server.
Getting metrics from an in-house application is something that needs to be implemented during development, fortunately, official and community provided integrations exist for the most popular languages such as Java, Go or Rust.
Alertmanager
Access to consistent metrics and logs is already a solid step in the right direction but being alerted when a condition is met (or not met!) is just as important.
What we need is a tool that can handle alerts sent by client applications, such as the Prometheus server or the Loki log system, and have it take care of deduplicating, grouping, and routing of alerts. Enter the appropriately named Alertmanager, specifically designed to meet the requirements of a modern monitoring platform. While it is still under heavy development and has some quirks it is a reliable system once properly configured.
Pushgateway
There are times when the application or function you want to monitor is not well suited to producing regular metrics, such as batch jobs that may run daily or just intermittently. For example, lets say that every month you calculate the total amount of transactions for each customer and notify relevant people with the details. This might be done by a bash script that sends an email to administrators with a summary of transactions per customer, wouldn’t it be nice to have this as a metric that can be easily graphed and monitored for unusual patterns? Thanks to Pushgateway you can easily achieve this by modifying the script to include a simple Curl call to your Prometheus server.
echo "Total transactions for $cust: $t" | curl --data-binary @- http://pushgateway.example.org:9091/metrics/job/transactions_summary # in reality you would have a couple of aditional parameters to ensure the created metric has a help text and appropriate labels
Now you can easily create graphs and alerts based on the output from this script and it only took a one line addition!
Grafana
Having metrics and logs is all well and good, but the real icing on the cake is seeing your data presented as beautiful graphs, charts and status panels that can be created with ease thanks to Grafana.
With support for a wide range of data sources and plugins for popular services, coupled with ease of use, Grafana deserves it’s place as one of the most popular solutions currently available.
Loki
Log management has seen a lot of activity over the past few years with lots of options available to those engineers whose job it is to ensure that their servers and applications are both running smoothly and are easy to troubleshoot. The relatively recent addition of Loki, from Grafana Labs, adds a powerful new tool to the list.
Loki does things a little differently from other well known applications such as ElasticSearch and Kibana. For example, thanks to the fact that Loki indexes just labels instead of entire log lines the storage and compute requirements are tiny in comparison. Loki is under heavy development but is already quite mature and easy to get up and running, it also integrates extremely well with Prometheus and Grafana.
Promtail
Similar to node_exporter for Prometheus, Loki relies on other applications in order to get logs from your servers. Promtail is, like node_exporter, an agent that tails the logs you specify in it’s configuration and pushes them to Loki.
Promtail will also parse those logs and can even extract metric data which can be scraped by Prometheus, for example, perhaps you want to measure the rate of failed logins during a brute force attack:
- regex:
expression: "^.*(?P<failed_logins>login failed).*$"
- metrics:
failed_logins_total:
type: Counter
description: "total number of failed logins"
source: failed_logins
config:
action: inc
With it’s built in support for pipelines Promtail can change the format, extract labels and create metrics from the structured or unstructured lines contained in your log files.
Summary
Monitoring is a huge subject and we have only just barely scratched the surface in this introduction to some of the available tools. Putting these components together into a platform for observability that fits your infrastructure is where the real fun begins.
Useful links for further reading
- Grafana installation guide
- Prometheus documentation
- Loki introduction
- Promtail documentation