Configure Loki Alerting with Alertmanager in Kubernetes

Aaron F. - 2023-04-07

Loki is a log aggregation system that is commonly used in many Kubernetes clusters. It has been coined by Grafana as the “Prometheus for logging” since it allows us to view and store logs from all of the applications running in an infrastructure in one place. Just like Prometheus, Loki can be configured to send alerts based on logs or log metrics that can be defined in the rule definition. However, most instructions out there are geared more towards configuring this alerting using a Docker setup. When I did this for the first time, I was using Kubernetes and noticed that instructions for doing this in Kubernetes with the Loki Helm Chart were a bit lacking.

Below are the steps that this post will follow to achieve the goal of sending log-based alerts from Loki to Alertmanager:

Defining custom Loki alert rules in a Kubernetes ConfigMap
Mounting the ConfigMap to the loki-read pods
Configuring the Loki Ruler with the correct Alertmanager location and storage settings
Verifying that rules are being applied correctly and triggering
Troubleshooting

Prerequisites

Before starting this, you must have Loki and Promtail up and running also with a working instance of Grafana and Alertmanager. In my example, I have put them all into the same monitoring namespace, but they do not necessariliy have to be. I will try to highlight areas where namespace could be important if things are not in the same namespace.

I use the kube-prometheus-stack helm chart to deploy Prometheus, Grafana, and Alertmanager. This can be done differently if you choose since this comes with other components that may not be necessary, but it’s the easiest to get everything up and running quickly. I am using version 35.2.0-1.
Loki deployed via the Loki Helm Chart. I am using version 3.8.0.
Promtail must be setup and working. I used the Promtail Helm Chart and I am on version 6.8.0.

Other General Knowledge

Familiar with LogQL and how Loki Rules and the Ruler work.

Defining Alerting Rules

The first step to setting up alerting will be defining what we want Loki to alert on. It would probably be helpful to first spend some time in Grafana and look through the logs that Loki is collecting and come up with a log or metric query that will trigger an alert.

For this post, my example query is going to be the one below. Feel free to also use this to test the integration since you will also have the loki app running in your environment:

sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0

What this query does:

gets all logs for the loki app
uses logfmt to parse the log messages and apply key value pairs to the log data.
filter for where level is info which should generate a lot of messages.
get the rate of these messages per minute by container where the rate is greater than 0.

This will need to be put into a Kubernetes ConfigMap and that ConfigMap will be applied to the Loki read pods.

# /templates/configmap-rules.yaml  
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-alerting-rules
  labels:
    release: monitoring-metrics
data:
  rules.yaml: |
    groups:
      - name: test_alert
        rules:
          - alert: TestAlert
            expr: sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
            for: 1m
            labels:
                severity: warning
            annotations:
                summary: Loki info warning per minute rate > 0
                message: 'Loki warning per minute rate > 0 container:"{{`{{`}} $labels.container {{`}}`}}"'

The configuration for this ConfigMap is pretty standard, but it’s important for the Loki ruler that the rules be in a file that has the .yaml extension. So in the data of the ConfigMap, it’s important that the first key includes .yaml since this will be the name of the file that is applied to the pods. The value of this key is a string and it is the contents of the rule file.

Apply rules to the Pods

The rules defined in the ConfigMap above should only be applied to the loki-read pods that are created from the Loki helm chart. This configuration will be applied in the values.yaml file and can be done by applying the following:

# values.yaml  
loki:
  read:
    extraVolumeMounts:
      - name: rules
        mountPath: "/var/loki/rulestorage/fake"
    extraVolumes:
      - name: rules
        configMap:
          name: loki-alerting-rules

This will mount the data from the ConfigMap we created into the pods at /var/loki/rulestorage/fake in a file named rules.yaml. The reason fake is in the path is that this is the instance ID when running in single-tenancy mode. The docs for Loki do not explain this at all but it will not work without that in the path.

Setup the Loki Ruler

The Loki ruler is what will evalute the rules on a set time interval and trigger alerts. The ruler setup can all be done from the values.yaml file of the Loki helm chart.

loki:    loki:
    auth_enabled: false
    rulerConfig:
      storage:
        type: local
        local:
          directory: /var/loki/rulestorage
      rule_path: "/var/loki/rules-temp"
      ring:
        kvstore:
          store: inmemory
      alertmanager_url: http://alertmanager-operated.monitoring:9093
      enable_alertmanager_v2: true

In the above config, the rulerConfig section contains values that can be found in the Ruler section of Loki’s configuration Docs. I will explain a little of what is going on in each step

storage is set to local, even if you are using cloud storage for chunks. The reason for this is because we do not want the ruler to read from rules stored in our cloud storage bucket, we want it to read from rules that are stored locally in files on the pods.
rule_path is the output of the temp rule file that is generated when the ruler evaluates the rules file that was mounted to /var/loki/rulestorage. This location is very important because the loki user on the pod must have write access to that directory. See the troubleshooting section for more on this.
alertmanager_url should be set to the DNS name of the Alertmanager service running in the Kubernetes cluster. It should follow typical Kubernetes Service DNS rules based on the service name and namespace that Alertmanager is in, with port 9093 appended to the end.
enable_alertmanager_v2 must be set to true to use the Alertmanager v2 API.

Verify Alerts Appear in Alertmanager

Apply the changes to the Loki helm chart however you typically would apply changes to a helm chart in your environment. Then verify that all Loki pods have started without any errors.

Once they have started, go to the Alertmanager dashboard (this may require a portforward to access, if it does setup the portforward in Kubernetes so you can access the Alertmanager UI). There should be a new Alert firing in the “Not Grouped” section. Expand it and verify that the new alert is the alert setup in the ConfigMap.

Troubleshooting

Below are the issues I encountered when setting this up and steps I took to debug and resolve them.

Verify loki-alerting-rules and loki ConfigMaps Are Present and Correct

First, lets verify that our loki-alerting-rules ConfigMap has been applied:

kubectl get configmap --namespace monitoring

In the resulting list, we should see this entry (it may not be at the top):

NAME                                                           DATA   AGE  loki-alerting-rules                                            1      47h

The configuration for Loki mostly lives in a ConfigMap named loki, and we will use this command to view our loki configuration to verify that our rulerConfig was applied:

kubectl describe configmap --namespace monitoring loki

The output should contain the following:

ruler:    alertmanager_url: http://alertmanager-operated:9093
  enable_alertmanager_v2: true
  ring:
    kvstore:
      store: inmemory
  rule_path: /var/loki/rules-temp
  storage:
    local:
      directory: /var/loki/rulestorage
    type: local

Verify rules.yaml was mounted on pods

We will need to exec into the pods to verify that our rules file exists in the location we mounted our ConfigMap and that the ruler was able to evalute the rules and output the evaluated rules to the rules_temp directory (which will look very similar to the rules file).

Access one of the loki-read pods with the following command:

kubectl exec -it --namespace monitoring loki-read-0 sh

Once inside of the pod, navigate to the /var/loki directory and verify that the rules.yaml file exists in the rulestorage/fake directory. The file should contain the string contents of rules.yaml in the loki-alerting-rules ConfigMap.

/var/loki $ cat rulestorage/fake/rules.yaml   groups:
  - name: high_error_rate
    rules:
      - alert: LokiHighErrorRate
        expr: sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
        for: 1m
        labels:
            severity: warning
        annotations:
            summary: Loki error per minute rate > 100
            message: 'Loki warning per minute rate > 0 container:"{{ $labels.container }}"

Verify Ruler evaluated rules.yaml properly

Verify that the ruler was able to evaluate the rules and placed them into the rules-temp/fake directory.

/var/loki $ cat rules-temp/fake/rules.yaml   groups:
    - name: high_error_rate
      rules:
        - alert: LokiHighErrorRate
          expr: sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
          for: 1m
          labels:
            severity: warning
          annotations:
            message: 'Loki warning per minute rate > 0 container:"{{ $labels.container }}"
            summary: Loki error per minute rate > 100

If the above cat command returns an error, go to Grafana and search in Loki using the following LogQL expression:

Note: You may need to increase the time window on the query.

{app="loki"} | logfmt | level="error" |= "rule"

Look for a log that includes a similar message to this:

level=error ts=2023-01-03T21:01:40.137688115Z caller=manager.go:127 msg="unable to map rule files" user=fake err="open /tmp/loki/rules-temp/fake/rules.yaml: read-only file system"

Notice that here the file path is incorrect from what is supposed to be configured above. This indicates that the rule_path in values.yaml is misconfigured and the ruler is trying to write the evaluated rules to rules.yaml as the loki user on the pod but the filesystem is read-only by default except for /var/loki because of how it is mounted in the Helm chart. So double check your rule_path to verify that it is set to /var/loki/rules-temp.

Further Troubleshooting

Depending on other settings in your environment, you could potentially run into other issues that I did not. My advice there would be to use Loki to dig through the pod logs on the read pods, typically there will be some indication there of what is going on that is preventing rules from being fired or appearing in Alertmanager.

You will also want to verify that there are not any typos in your rule query expression. In Loki use this query to verify if there are any issues with your query:

{app="loki"} | logfmt | level="warn" |~ "pipeline error: 'LogfmtParserErr'"

Log messages should appear indicating the issue with the query syntax.

Thank You!

I hope this post was useful and informative! If you would like, feel free to check out my LinkedIn or my Github.

This application needs javascript to run, please enable it.