Loki is a log aggregation system that is commonly used in many Kubernetes clusters. It has been coined by Grafana as the “Prometheus for logging” since it allows us to view and store logs from all of the applications running in an infrastructure in one place. Just like Prometheus, Loki can be configured to send alerts based on logs or log metrics that can be defined in the rule definition. However, most instructions out there are geared more towards configuring this alerting using a Docker setup. When I did this for the first time, I was using Kubernetes and noticed that instructions for doing this in Kubernetes with the Loki Helm Chart were a bit lacking.
Below are the steps that this post will follow to achieve the goal of sending log-based alerts from Loki to Alertmanager:
-
Defining custom Loki alert rules in a Kubernetes ConfigMap
-
Mounting the ConfigMap to the
loki-read
pods
-
Configuring the Loki Ruler with the correct Alertmanager location and storage settings
-
Verifying that rules are being applied correctly and triggering
-
Troubleshooting
Prerequisites
Before starting this, you must have Loki and Promtail up and running also with a working instance of Grafana and Alertmanager. In my example, I have put them all into the same monitoring
namespace, but they do not necessariliy have to be. I will try to highlight areas where namespace could be important if things are not in the same namespace.
-
I use the kube-prometheus-stack helm chart to deploy Prometheus, Grafana, and Alertmanager. This can be done differently if you choose since this comes with other components that may not be necessary, but it’s the easiest to get everything up and running quickly. I am using version 35.2.0-1.
-
Loki deployed via the Loki Helm Chart. I am using version 3.8.0.
-
Promtail must be setup and working. I used the Promtail Helm Chart and I am on version 6.8.0.
Other General Knowledge
Familiar with LogQL and how Loki Rules and the Ruler work.
Defining Alerting Rules
The first step to setting up alerting will be defining what we want Loki to alert on. It would probably be helpful to first spend some time in Grafana and look through the logs that Loki is collecting and come up with a log or metric query that will trigger an alert.
For this post, my example query is going to be the one below. Feel free to also use this to test the integration since you will also have the loki
app running in your environment:
sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
What this query does:
-
gets all logs for the
loki
app
-
uses
logfmt
to parse the log messages and apply key value pairs to the log data.
-
filter for where
level
is info
which should generate a lot of messages.
-
get the rate of these messages per minute by container where the rate is greater than 0.
This will need to be put into a Kubernetes ConfigMap and that ConfigMap will be applied to the Loki read
pods.
# /templates/configmap-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-alerting-rules
labels:
release: monitoring-metrics
data:
rules.yaml: |
groups:
- name: test_alert
rules:
- alert: TestAlert
expr: sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
for: 1m
labels:
severity: warning
annotations:
summary: Loki info warning per minute rate > 0
message: 'Loki warning per minute rate > 0 container:"{{`{{`}} $labels.container {{`}}`}}"'
The configuration for this ConfigMap is pretty standard, but it’s important for the Loki ruler that the rules be in a file that has the .yaml
extension. So in the data
of the ConfigMap, it’s important that the first key includes .yaml
since this will be the name of the file that is applied to the pods. The value of this key is a string and it is the contents of the rule file.
Apply rules to the Pods
The rules defined in the ConfigMap above should only be applied to the loki-read
pods that are created from the Loki
helm chart. This configuration will be applied in the values.yaml
file and can be done by applying the following:
# values.yaml
loki:
read:
extraVolumeMounts:
- name: rules
mountPath: "/var/loki/rulestorage/fake"
extraVolumes:
- name: rules
configMap:
name: loki-alerting-rules
This will mount the data from the ConfigMap we created into the pods at /var/loki/rulestorage/fake
in a file named rules.yaml
. The reason fake
is in the path is that this is the instance ID when running in single-tenancy mode. The docs for Loki do not explain this at all but it will not work without that in the path.
Setup the Loki Ruler
The Loki ruler is what will evalute the rules on a set time interval and trigger alerts. The ruler setup can all be done from the values.yaml
file of the Loki helm chart.
loki: loki:
auth_enabled: false
rulerConfig:
storage:
type: local
local:
directory: /var/loki/rulestorage
rule_path: "/var/loki/rules-temp"
ring:
kvstore:
store: inmemory
alertmanager_url: http://alertmanager-operated.monitoring:9093
enable_alertmanager_v2: true
In the above config, the rulerConfig
section contains values that can be found in the Ruler section of Loki’s configuration Docs. I will explain a little of what is going on in each step
-
storage
is set to local
, even if you are using cloud storage for chunks. The reason for this is because we do not want the ruler to read from rules stored in our cloud storage bucket, we want it to read from rules that are stored locally in files on the pods.
-
rule_path
is the output of the temp rule file that is generated when the ruler evaluates the rules file that was mounted to /var/loki/rulestorage
. This location is very important because the loki
user on the pod must have write access to that directory. See the troubleshooting section for more on this.
-
alertmanager_url
should be set to the DNS name of the Alertmanager service running in the Kubernetes cluster. It should follow typical Kubernetes Service DNS rules based on the service name and namespace that Alertmanager is in, with port 9093
appended to the end.
-
enable_alertmanager_v2
must be set to true to use the Alertmanager v2 API.
Verify Alerts Appear in Alertmanager
Apply the changes to the Loki helm chart however you typically would apply changes to a helm chart in your environment. Then verify that all Loki pods have started without any errors.
Once they have started, go to the Alertmanager dashboard (this may require a portforward to access, if it does setup the portforward in Kubernetes so you can access the Alertmanager UI). There should be a new Alert firing in the “Not Grouped” section. Expand it and verify that the new alert is the alert setup in the ConfigMap.
Troubleshooting
Below are the issues I encountered when setting this up and steps I took to debug and resolve them.
Verify loki-alerting-rules and loki ConfigMaps Are Present and Correct
First, lets verify that our loki-alerting-rules
ConfigMap has been applied:
kubectl get configmap --namespace monitoring
In the resulting list, we should see this entry (it may not be at the top):
NAME DATA AGE loki-alerting-rules 1 47h
The configuration for Loki mostly lives in a ConfigMap named loki
, and we will use this command to view our loki
configuration to verify that our rulerConfig was applied:
kubectl describe configmap --namespace monitoring loki
The output should contain the following:
ruler: alertmanager_url: http://alertmanager-operated:9093
enable_alertmanager_v2: true
ring:
kvstore:
store: inmemory
rule_path: /var/loki/rules-temp
storage:
local:
directory: /var/loki/rulestorage
type: local
Verify rules.yaml was mounted on pods
We will need to exec into the pods to verify that our rules file exists in the location we mounted our ConfigMap and that the ruler was able to evalute the rules and output the evaluated rules to the rules_temp
directory (which will look very similar to the rules file).
Access one of the loki-read pods with the following command:
kubectl exec -it --namespace monitoring loki-read-0 sh
Once inside of the pod, navigate to the /var/loki
directory and verify that the rules.yaml
file exists in the rulestorage/fake
directory. The file should contain the string contents of rules.yaml
in the loki-alerting-rules
ConfigMap.
/var/loki $ cat rulestorage/fake/rules.yaml groups:
- name: high_error_rate
rules:
- alert: LokiHighErrorRate
expr: sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
for: 1m
labels:
severity: warning
annotations:
summary: Loki error per minute rate > 100
message: 'Loki warning per minute rate > 0 container:"{{ $labels.container }}"
Verify Ruler evaluated rules.yaml properly
Verify that the ruler was able to evaluate the rules and placed them into the rules-temp/fake
directory.
/var/loki $ cat rules-temp/fake/rules.yaml groups:
- name: high_error_rate
rules:
- alert: LokiHighErrorRate
expr: sum(rate({app="loki"} | logfmt | level="info"[1m])) by (container) > 0
for: 1m
labels:
severity: warning
annotations:
message: 'Loki warning per minute rate > 0 container:"{{ $labels.container }}"
summary: Loki error per minute rate > 100
If the above cat
command returns an error, go to Grafana and search in Loki using the following LogQL expression:
Note: You may need to increase the time window on the query.
{app="loki"} | logfmt | level="error" |= "rule"
Look for a log that includes a similar message to this:
level=error ts=2023-01-03T21:01:40.137688115Z caller=manager.go:127 msg="unable to map rule files" user=fake err="open /tmp/loki/rules-temp/fake/rules.yaml: read-only file system"
Notice that here the file path is incorrect from what is supposed to be configured above. This indicates that the rule_path
in values.yaml is misconfigured and the ruler is trying to write the evaluated rules to rules.yaml
as the loki
user on the pod but the filesystem is read-only by default except for /var/loki
because of how it is mounted in the Helm chart. So double check your rule_path
to verify that it is set to /var/loki/rules-temp
.
Further Troubleshooting
Depending on other settings in your environment, you could potentially run into other issues that I did not. My advice there would be to use Loki to dig through the pod logs on the read
pods, typically there will be some indication there of what is going on that is preventing rules from being fired or appearing in Alertmanager.
You will also want to verify that there are not any typos in your rule query expression. In Loki use this query to verify if there are any issues with your query:
{app="loki"} | logfmt | level="warn" |~ "pipeline error: 'LogfmtParserErr'"
Log messages should appear indicating the issue with the query syntax.
Thank You!
I hope this post was useful and informative! If you would like, feel free to check out my LinkedIn or my Github.