Friday, August 30, 2019

Graphing Prometheus Data With Grafana

Monitoring data isn't much use without the ability to display it. If you want to learn how to setup prometheus on your kubernetes cluster, have a look at this previous blog post, Monitoring Kubernetes With Prometheus. There are also lots of quasi pre-built dashboards over at grafana, but as I've found, you're invariably going to need to build your own. This post is going to look at how to put together some more problematic queries and results without losing your sanity.

Problem Statement

I want the ability to drill in to my cluster starting with the cluster, to the nodes, to pods on that node, and information about those pods. In this case I wanted to end up with a table that looks something like this:
We'll focus in on the persistent volume information for now as this is where things get complicated. For all the other metrics, you can key off the pod name, every query has that in the result and Grafana will find the common names and match them up. With a persistent volume, we need to make multiple queries following a path to get the final capacity numbers. There are a couple of ways to do this, but I followed this query logic. Max just tells prometheus to return parts of the query rather than every label.

Query #1 - max(kube_pod_info{node="$node"}) by (pod)
-- This will return a list of pods for a given node
Query #2 - max(kube_pod_spec_volumes_persistentvolumeclaims_info) by (persistentvolumeclaim, pod, volume)
-- Will return a list of persistent volume claims for each pod
Query #3 - max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim)
-- Will return a list of persistent volume claims and their size as the value
I end up with a query path that looks like this:
The problems is, if you query all three of these individually, Grafana won't know how to assemble your table as there is no linkage between all three. If you use pod name then the PVC capacity doesn't have a match. If you use persistent volume claim then the list of pods doesn't have a match.

Solution

Query #1 is fine. It's pulling a full list of pods, but somehow I need a combination of query #2 and query #3 where we take the labels of query #2 and merge it with the result of query #3. Without that combination there's no way to match the capacity all the way back up to the node and you get a very ugly table. The closest explanation I found was on stack overflow but it still took a bit to translate that to my requirements so I'm going to try and show this with the results from each query.
  • the value_metric - max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim)
    • {persistentvolumeclaim="datadir-etcd-dns-0"} 1073741824
  • the info_metric - max(kube_pod_spec_volumes_persistentvolumeclaims_info) by (persistentvolumeclaim, pod, volume)
    • {persistentvolumeclaim="datadir-etcd-dns-0",pod="etcd-dns-0",volume="datadir"} 1
We'll use two prometheus operators; on() which specified how to match the queries up, and group_left() which specifies the labels to pull from the info metric. Then we use the following format:
<value_metric> * on (<match_label>) group_left(<info_labels>) <info_metric>
And we end up with the following query:
max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim) * on (persistentvolumeclaim) group_left(pod,volume) max(kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"$pod"}) by (persistentvolumeclaim, pod, volume)
Now when the query inspector is used we get an object containing 3 labels, pod, volume, and persistentvolumeclaim, and the value has a timestamp and our capacity information. This can now be paired up to the other queries containing a pod name because there's a common element