Logical Shift: Monitoring Kubernetes With Prometheus

There are several ways to monitor a kubernetes cluster, some free, some paid, and some specific to a vendor's cluster implementation. If you're looking for the easy button, look at purchasing a solution; there are many out there, however, for this article, we'll be deploying something slightly more confusing but free using Prometheus as a monitoring solution.

Let's start with a few concepts:

Prometheus pulls data through a process called a scrape. Scraping is a handy approach as you don't need agents pushing data everywhere but can limit your scalability
Prometheus uses metric endpoints which are configured using jobs. We'll be installing a couple of end points to look at specific pieces of the infrastructure, but if you want your applications to be included, they'll need to support prometheus and have specific tags setup in their deployment file
There's lots of documentation about Prometheus' ability to self monitor and automatically pick up new end points; this is awesome

There are many different endpoints you can choose and things can become confusing very quickly, so hopefully this guide will give you a base implementation to expand on as you see fit for your environment. To do this there will be three base components we'll be setting up:

prometheus - the collector and repository of metric data; obviously
kube-state-metrics - deep kubernetes metrics such as pod and node utilization. This is kind of like a more detailed version of metrics-server which is itself a replacement for Heapster. It's confusing, I know.
node_exporter - collected detailed node metrics. There will be one endpoint per node in the cluster

Namespace And Role Setup

I like the idea of keeping my monitoring pieces in its own namespace so we'll be creating a monitoring namespace and creating some roles for the different components to use. I've included that namespace creation during the prometheus cluster role setup, if you're doing things differently, make sure to take that into account. They're long, sorry, and suitable for github, but I wanted to make this as easy to follow as possible:

$ cat clusterRole-prometheus.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

And our cluster role and service account setup for kube-state-metrics.

$ cat clusterRole-kube-state.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - ingresses
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - statefulsets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
- apiGroups: ["policy"]
  resources:
  - poddisruptionbudgets
  verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
  resources:
  - certificatesigningrequests
  verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources:
  - storageclasses
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring

Prometheus Configuration

The next thing we'll need is a configuration map for Prometheus itself. I've elected to collect data every minute, there are examples where people are collecting every 5 seconds, pick what makes sense to you. The scrape jobs are largely specific to the type of end point in use. If you're within the same cluster, the certificates and bearer tokens will automatically be pulled into the appropriate containers but if not you'll need to reference them directly which is out of scope for this article.

$ cat prometheus-config-map.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 1m
      evaluation_interval: 1m
      scrape_timeout: 10s
    rule_files:
      - /etc/prometheus/prometheus.rules
    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secretes/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-nodes-cadvisor'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name

Prometheus Deployment

Once the configuration file has been setup it's time to deploy Prometheus itself. This configuration is using a deployment, you could easily convert this to a stateful set if you like. Please note the persistent volume created here belongs to the monitoring namespace, so if you clean that namespace, for example to reload the environment, you will wipe the old data. Also, keep in mind the storage class in use, I'm using vsphere-ssd from my original blog post which might not be suitable for your environment. I'm also using a NodePort as it's always available and doesn't require additional network components but the inbound port will change every time you deploy so that might not be ideal long term.

$ cat prometheus-server.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: monitoring
  labels:
    app: prometheus
spec:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: "vsphere-ssd"
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      securityContext:
        fsGroup: 65534
      containers:
      - name: prometheus
        image: prom/prometheus:v2.10.0
        volumeMounts:
          - name: prometheus-config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yml
          - name: data
            mountPath: /prometheus
        ports:
        - containerPort: 9090
      volumes:
        - name: prometheus-config-volume
          configMap:
           name: prometheus-server-conf
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-data
      serviceAccountName: prometheus
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
spec:
  selector:
    app: prometheus-server
  ports:
  - name: promui
    protocol: TCP
    port: 9090
    targetPort: 9090
  type: NodePort

End Point Deployment

Our last two configuration files are to set up a deployment and internal service for kube-state-metrics and a daemon set and internal service for node-exporter. Because this is a daemon set, it'll get a copy on every node automatically when one is added. Self managed monitoring!

$ cat prometheus-kube-state.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.6.0
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    k8s-app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics

$ cat prometheus-node-exporter.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    k8s-app: node-exporter
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v0.18.1
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
      version: v0.18.1
  updateStrategy:
    type: OnDelete
  template:
    metadata:
      labels:
        k8s-app: node-exporter
        version: v0.18.1
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      containers:
        - name: prometheus-node-exporter
          image: prom/node-exporter:v0.18.1
          imagePullPolicy: "IfNotPresent"
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
          ports:
            - name: metrics
              containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly:  true
            - name: sys
              mountPath: /host/sys
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
      hostNetwork: true
      hostPID: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
---
apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "NodeExporter"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9100
      protocol: TCP
      targetPort: 9100
  selector:
    k8s-app: node-exporter

Deployment and Cleanup Scripts

There are a lot of yaml files here and things get annoying when you want to deploy, test, clean, repeat, so here are two simple scripts that will deploy all of these files and then clean everything up again if you need them:

$ cat deploy.sh 
kubectl create -f clusterRole-prometheus.yaml
kubectl create -f prometheus-config-map.yaml
kubectl create -f prometheus-server.yaml
kubectl create -f prometheus-node-exporter.yaml
kubectl create -f clusterRole-kube-state.yaml
kubectl create -f prometheus-kube-state.yaml

The bulk of the cleanup happens when you delete the namespace, but remember, this will also delete the persistent volume hosting the prometheus data.

$ cat cleanup.sh 
kubectl delete namespace monitoring
kubectl delete clusterrolebinding prometheus
kubectl delete clusterrole prometheus
kubectl delete clusterrolebinding kube-state-metrics
kubectl delete clusterrole kube-state-metrics

Accessing Prometheus

You can check the pods and get your access point using these commands. In my cluster, I've got two worker nodes so I've got two node-exporters.

$ kubectl get pods -n monitoring
NAME                                  READY   STATUS    RESTARTS   AGE
kube-state-metrics-699fdf75f8-cqq5t   1/1     Running   0          18h
node-exporter-88297                   1/1     Running   0          18h
node-exporter-wb2lk                   1/1     Running   0          18h
prometheus-server-6f9d9d86d4-m8x4f    1/1     Running   0          18h

And the all important services are here.

$ kubectl get svc -n monitoring
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
kube-state-metrics   ClusterIP   10.110.207.159   <none>        8080/TCP,8081/TCP   18h
node-exporter        ClusterIP   None             <none>        9100/TCP            18h
prometheus-service   NodePort    10.101.114.89    <none>        9090:30044/TCP      18h

You can see our NodePort service, which means I can hit either node with the port 30044 and be redirected to my pod on port 9090. Let's try that in a browser: http://k8s-n2.itlab.domain.com:30044/graph, where you should be presented with the Prometheus dashboard.

If you click on Status > Targets you should see everything Prometheus is scraping and you hover of the Labels you will see a lot of information including job="job_name". This can be particularly useful to tie back to your config map, especially when some things show as down.

If you have been following my logging article, there are annotations in there for prometheus, which are now helpful, as any metrics fluentbit provides are automatically added. Targets should look something like this with all endpoints in a state of up.

Grafana Integration

The last step in this guide is where the real work begins. If you don't have a Grafana instance up and running, I'd suggest setting one up on a separate linux box. There are some excellent guides with essentially one command to run: https://grafana.com/docs/installation/rpm/.

After that's done, add a data source of type Prometheus with the web URL you used to access the dashboard above. In my case, http://k8s-n2.itlab.domain.com:30044, and you can start creating dashboards. There are also some prebuilt ones that have been publicly hosted which you can find on Grafana's website, https://grafana.com/dashboards and add them by simply placing the dashboard ID into the grafana import as documented by Grafana: https://grafana.com/docs/reference/export_import/.

Notes

This is just scratching the surface of monitoring but hopefully it gives you enough of a framework to build from. Alert manager is a key piece yet to be discussed, long term storage, and of course dashboards, lots and lots of dashboards. Some articles I found particularly helpful:

Logical Shift

Wednesday, June 19, 2019

Monitoring Kubernetes With Prometheus