Friday, August 30, 2019

Graphing Prometheus Data With Grafana

Monitoring data isn't much use without the ability to display it. If you want to learn how to setup prometheus on your kubernetes cluster, have a look at this previous blog post, Monitoring Kubernetes With Prometheus. There are also lots of quasi pre-built dashboards over at grafana, but as I've found, you're invariably going to need to build your own. This post is going to look at how to put together some more problematic queries and results without losing your sanity.

Problem Statement

I want the ability to drill in to my cluster starting with the cluster, to the nodes, to pods on that node, and information about those pods. In this case I wanted to end up with a table that looks something like this:
We'll focus in on the persistent volume information for now as this is where things get complicated. For all the other metrics, you can key off the pod name, every query has that in the result and Grafana will find the common names and match them up. With a persistent volume, we need to make multiple queries following a path to get the final capacity numbers. There are a couple of ways to do this, but I followed this query logic. Max just tells prometheus to return parts of the query rather than every label.

Query #1 - max(kube_pod_info{node="$node"}) by (pod)
-- This will return a list of pods for a given node
Query #2 - max(kube_pod_spec_volumes_persistentvolumeclaims_info) by (persistentvolumeclaim, pod, volume)
-- Will return a list of persistent volume claims for each pod
Query #3 - max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim)
-- Will return a list of persistent volume claims and their size as the value
I end up with a query path that looks like this:
The problems is, if you query all three of these individually, Grafana won't know how to assemble your table as there is no linkage between all three. If you use pod name then the PVC capacity doesn't have a match. If you use persistent volume claim then the list of pods doesn't have a match.

Solution

Query #1 is fine. It's pulling a full list of pods, but somehow I need a combination of query #2 and query #3 where we take the labels of query #2 and merge it with the result of query #3. Without that combination there's no way to match the capacity all the way back up to the node and you get a very ugly table. The closest explanation I found was on stack overflow but it still took a bit to translate that to my requirements so I'm going to try and show this with the results from each query.
  • the value_metric - max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim)
    • {persistentvolumeclaim="datadir-etcd-dns-0"} 1073741824
  • the info_metric - max(kube_pod_spec_volumes_persistentvolumeclaims_info) by (persistentvolumeclaim, pod, volume)
    • {persistentvolumeclaim="datadir-etcd-dns-0",pod="etcd-dns-0",volume="datadir"} 1
We'll use two prometheus operators; on() which specified how to match the queries up, and group_left() which specifies the labels to pull from the info metric. Then we use the following format:
<value_metric> * on (<match_label>) group_left(<info_labels>) <info_metric>
And we end up with the following query:
max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim) * on (persistentvolumeclaim) group_left(pod,volume) max(kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"$pod"}) by (persistentvolumeclaim, pod, volume)
Now when the query inspector is used we get an object containing 3 labels, pod, volume, and persistentvolumeclaim, and the value has a timestamp and our capacity information. This can now be paired up to the other queries containing a pod name because there's a common element

Wednesday, July 24, 2019

Windows Kubernetes Nodes

It's happened, someone has asked you for a windows container. The first piece of advice to give; find a way, any way possible to run the service within a linux container. Not being knowledgeable with linux shouldn't be an excuse as most containers require very little expertise with linux anyway. Help the windows developers migrate to linux, everyone will be happier.

That being said, sometimes re-coding a service for linux just isn't practical. And while things are pretty bad now, Microsoft is a really big company who seems to want into this market; they'll put effort into making things better over time. As an example of this, their documentation is quite good, way better than most of the linux documentation in my opinion. Have a look at it as portions of this guide are lifted directly from it [https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/getting-started-kubernetes-windows]

Cluster Setup

You'll need at least one linux box to serve as a master, although we're going to use a few more to host other infrastructure services. You can follow my previous blog post with one exception; you can't run Calico as Container Network Interface (CNI). Well, technically you can, but Calico for windows is provided only as a subscription service and Microsoft only documents networking with Flannel as the CNI, so that's what we'll use here.

**When you initialize the cluster, be sure to use Flannel's default pod cidr of 10.244.0.0/16 or you'll have problems when setting up microsoft nodes**
[root@kube-master ~]# kubeadm init --pod-network-cidr=10.244.0.0/16
Setting up Flannel is pretty easy you'll download the flannel yaml file and make some Microsoft specific changes, notably the VNI and PORT number as documented on github and from microsoft.
[root@kube-master ~]# wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Within the ConfigMap section you'll have the following under net-conf.json:
net-conf.json: |
{
  "Network": "10.244.0.0/16",
  "Backend": {
   "Type": "vxlan"
  }
}
We need to add the required VNI and Port information like this:
net-conf.json: |
{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan",
    "VNI": 4096,
    "Port": 4789
  }
}
And install Flannel like this:
[root@kube-master ~]# kubectl create -f kube-flannel.yml

CNI Notes

What's wrong with running Flannel for all clusters? Flannel has a separate private network allocated to each node which is then encapsulated within UDP and passer to other nodes within the cluster. Microsoft supports two Flannel modes, vxlan mode (documented here) which creates a virtual overlay network to handle routes between nodes automatically, and host-gateway mode, which seems insane to me as you need a static route on each node to every other node's pod subnet; so I don't recommend that.

Calico, on the other hand, uses simple L3 routing within the cluster so it's much easier to see where traffic is going and where it came from. I like the ideal of Calico better, but it isn't a real option without a subscription so I'll stick with Flannel on my windows cluster. There are a few decent articles on the differences between the two:

Windows Nodes

You'll need to install Windows 2019. I use 2019 standard with desktop experience as I like to RDP to the box but maybe you're an extreme windows guru and can do all of this without. I've disabled the firewall and installed vmware tools. Joining a domain is entirely optional as we aren't going to use any of the domain services. If you do join, make sure you treat this as a high performance server, so take care with patch schedules and extra windows features like virus scanning. You'll also need to ensure your patch level is high enough. I recommend running microsoft update, again, and again as you'll get new patches after a reboot. The version you're running should be at least 17763.379 as provided by KB4489899. You can find this by running winver.

As mentioned before, Microsoft has done a good job documenting the steps so feel free to follow along there too. Everything should be done with an elevated powershell prompt (run as administrator). These first steps will add the repository and install docker.
PS C:\Users\Administrator> Install-Module -Name DockerMsftProvider -Repository PSGallery
PS C:\Users\Administrator> Install-Package -Name Docker -ProviderName DockerMsftProvider
Reboot the machine and check docker is running properly
PS C:\Users\Administrator> Restart-Computer
PS C:\Users\Administrator> docker version
Client: Docker Engine - Enterprise
 Version:           19.03.0
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        87b1f470ad
 Built:             07/16/2019 23:41:30
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.0
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.5
  Git commit:       87b1f470ad
  Built:            07/16/2019 23:39:21
  OS/Arch:          windows/amd64
  Experimental:     false
If you get an error here that looks like this:
error during connect: Get http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.39/version: open //./pipe/docker_engine: The system cannot find the file specified. In the default daemon configuration on Windows, the docker client must be run elevated to connect. This error may also indicate that the docker daemon is not running.
It just means that docker service didn't start on boot. Start it from services or from powershell using Start-Service docker

Create Pause Image

A pause image is also run on your linux nodes, but automatically; we need to do that manually here including downloading it, tagging it, and check that it runs correctly.
PS C:\Users\Administrator> docker pull mcr.microsoft.com/windows/nanoserver:1809
PS C:\Users\Administrator> docker tag mcr.microsoft.com/windows/nanoserver:1809 microsoft/nanoserver:latest
PS C:\Users\Administrator> docker run microsoft/nanoserver:latest
Microsoft Windows [Version 10.0.17763.615]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\>

Download Node Binaries

You'll need several binaries available from Kubernetes' github page. The version should match the server as close as possible. The official skew policy can be found at kubernetes.io, and if you want to see your client and server version you can use this command.
[root@kube-master ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
This is saying my client and server are version 1.15.1. To download the corresponding client version, you can use this link [https://github.com/kubernetes/kubernetes/releases/], select the CHANGELOG-<version>.md link and download the node binaries for windows. In this case the latest is 1.15.1 , so that works out well.

I've used unix to expand the node binaries, either mac or your master node would work fine using tar zxvf kubernetes-node-windows-amd64.tar.gz but you can also use windows with expand-archive. Once that's done you'll need to copy all the executables under the expanded kubernetes/node/bin/* to c:\k. I know lots of people will want to change that \k folder but don't. Microsoft has hard coded it into many scripts we'll be using. So save yourself headache and just go with it.

You'll also need to grab /etc/kubernetes/admin.conf from the master node and place that in c:\k too and download Microsoft's start script. For all of these, I used a shared folder within my RDP session but winSCP is also a good tool if you don't mind installing more software on your worker nodes. It should look like this when you're done.
PS C:\Users\Administrator> mkdir c:\k
PS C:\Users\Administrator> wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/start.ps1 -o c:\k\start.ps1
<download and transfer kubernetes node binaries and config file>
PS C:\k> dir
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        7/23/2019   2:12 PM           5447 config
-a----        7/18/2019   2:55 AM       40072704 kube-proxy.exe
-a----        7/18/2019   2:55 AM       40113152 kubeadm.exe
-a----        7/18/2019   2:55 AM       43471360 kubectl.exe
-a----        7/18/2019   2:55 AM      116192256 kubelet.exe
-a----        7/23/2019   2:01 PM           2447 start.ps1

Joining A Windows Node

You're finally ready to join a windows node! Again you can have a look at the documentation but if you've been following along, you'll only need two options.
  • ManagementIP - this is unfortunate as it'll require more scripting when you're ready to automate. It's the IP address of this worker node which you can get from ipconfig on your windows node
  • NetworkMode - we're using vxlan and the default is l2bridge so this will need to be set to overlay
Other fields that should be fine with defaults but you can check them with these commands
  • ServiceCIDR - verify with kubectl cluster-info dump | grep -i service-cluster
  • ClusterCIDR - check with kubectl cluster-info dump | grep -i cluster-cidr
  • KubeDnsServiceIP - verify the default (10.96.0.10) with kubectl get svc -n kube-system. Cluster-IP is the field you're interested in.
When you run the start.ps1 script it'll download a lot of additional scripts and binaries eventually spawning a few new powershell windows leaving the logging one open, which can be very helpful at this stage. Run the following replacing the IP in blue with your local windows server IP address (from ipconfig)
PS C:\k> .\start.ps1 -ManagementIP 10.9.176.94 -NetworkMode overlay

Initial Problems

I had trouble getting the kubelet process to start. You'll notice the node doesn't go ready and if you look at the processes it will have flannel and kube-proxy but no kubelet. It seems the start-kubelet.ps1 script that's downloaded is using outdated flags, so to fix that, remove the highlighted --allow-privileged=true from start-kubelet.ps1.
$kubeletArgs = @(
    "--hostname-override=$(hostname)"
    '--v=6'
    '--pod-infra-container-image=mcr.microsoft.com/k8s/core/pause:1.0.0'
    '--resolv-conf=""'
    '--allow-privileged=true'
    '--enable-debugging-handlers'
    "--cluster-dns=$KubeDnsServiceIp"
    '--cluster-domain=cluster.local'
    '--kubeconfig=c:\k\config'
    '--hairpin-mode=promiscuous-bridge'
    '--image-pull-progress-deadline=20m'
    '--cgroups-per-qos=false'
    "--log-dir=$LogDir"
    '--logtostderr=false'
    '--enforce-node-allocatable=""'
    '--network-plugin=cni'
    '--cni-bin-dir="c:\k\cni"'
    '--cni-conf-dir="c:\k\cni\config"'
    "--node-ip=$(Get-MgmtIpAddress)"
)

I also had a problem when provisioning persistent volumes even though they weren't for the windows node. If kubernetes can't identify all nodes in the cluster, it won't do anything. The error looks like this
I0723 23:57:11.379621       1 event.go:258] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"715df13c-8eeb-4ba4-9be1-44c8a5f03071", APIVersion:"v1", ResourceVersion:"480073", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vsphere-ssd": No VM found
E0723 23:59:26.375664       1 datacenter.go:78] Unable to find VM by UUID. VM UUID: 
E0723 23:59:26.375705       1 nodemanager.go:431] Error "No VM found" node info for node "kube-w2" not found
E0723 23:59:26.375718       1 vsphere_util.go:130] Error while obtaining Kubernetes node nodeVmDetail details. error : No VM found
E0723 23:59:26.375727       1 vsphere.go:1291] Failed to get shared datastore: No VM found
E0723 23:59:26.375787       1 goroutinemap.go:150] Operation for "provision-default/test[715df13c-8eeb-4ba4-9be1-44c8a5f03071]" failed. No retries permitted until 2019-07-24 00:01:28.375767669 +0000 UTC m=+355638.918509528 (durationBeforeRetry 2m2s). Error: "No VM found"
I0723 23:59:26.376127       1 event.go:258] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"715df13c-8eeb-4ba4-9be1-44c8a5f03071", APIVersion:"v1", ResourceVersion:"480073", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vsphere-ssd": No VM found
And my eventual solution was to reboot the master node. Sad, yes.

Updating A Node UUID

Like our linux nodes, you'll need to patch the node spec with the UUID of the node. Under windows you can retrieve that UUID with the following command but you'll need to reformat it.
PS C:\k> wmic bios get serialnumber
SerialNumber
VMware-42 3c fe 01 af 23 a9 a5-65 45 50 a3 db db 9d 69
And back on our kubernetes master node we'd patch the node like this
[root@k8s-master ~]# kubectl patch node <node_name> -p '{"spec":{"providerID":"vsphere://423CFE01-AF23-A9A5-6545-50A3DBDB9D69"}}'

Patching DaemonSets

A DaemonSet gets a pod pushed to every node in the cluster. This is generally bad because most things don't run on windows, so to prevent that you'll need to patch existing sets and use the node selector for application you produce. You can download the patch from microsoft or create your own file, it's pretty basic. If you've been following along with the code provided on github, those files already have the node selector set.
[root@kube-master ~]# wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/l2bridge/manifests/node-selector-patch.yml
[root@kube-master t]# cat node-selector-patch.yml 
spec:
  template:
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
We'll need to apply it to existing DaemonSets, notably kube-proxy and kube-flannel-ds-amd64.
[root@kube-master ~]# kubectl patch ds/kube-flannel-ds-amd64 --patch "$(cat node-selector-patch.yml)" -n=kube-system
[root@kube-master ~]# kubectl patch ds/kube-proxy --patch "$(cat node-selector-patch.yml)" -n=kube-system
If you've been getting errors on your windows node from flannel saying things like Error response from daemon: network host not found and Error: no such container, those should now stop.

Deploying A Test Pod

I'd suggest using the Microsoft provided yaml file although I reduced the number of replicas to 1 to simplify any troubleshooting.
[root@kube-master ~]# wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/l2bridge/manifests/simpleweb.yml -O win-webserver.yaml
[root@kube-master ~]# kubectl apply -f win-webserver.yaml
[root@kube-master ~]# kubectl get pods -o wide

Registering A Service

Every time you reboot you'll need to run the start command manually, which isn't all that useful. Microsoft has created some excellent instructions and a script to register the required services using the Non-Sucking Service Manager. Follow the instructions provided by Microsoft, which is basically placing both the sample script, called register-svc.ps1, and nssm.exe binary into c:\k.
PS C:\k> wget https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/flannel/register-svc.ps1 -o c:\k\register-svc.ps1
I did have problems with the default script as it seems to reference an incorrect pause image and have a problem with the allow-privileged statement as indicated above. To fix that, edit register-svc.ps1 and under the kubelet registration replace the --pod-infra-container=kubeletwin/pause with mcr.microsoft.com/k8s/core/pause:1.0.0 and remove --allow-privileged=true. It should be line 25 and will look like this when you're done;
.\nssm.exe set $KubeletSvc AppParameters --hostname-override=$Hostname --v=6 --pod-infra-container-image=mcr.microsoft.com/k8s/core/pause:1.0.0 --resolv-conf="" --enable-debugging-handlers --cluster-dns=$KubeDnsServiceIP --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false  --log-dir=$LogDir --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config
Once that's fixed, you can register your services with this command where ManagementIP is the windows node IP.
PS C:\k> .\register-svc.ps1 -ManagementIP <windows_node_ip> -NetworkMode overlay
You should see the services registered and running. If you get errors like these, it's probably because register-svc.ps1 wasn't edited correctly.
Service "flanneld" installed successfully!
Set parameter "AppParameters" for service "flanneld".
Set parameter "AppEnvironmentExtra" for service "flanneld".
Set parameter "AppDirectory" for service "flanneld".
flanneld: START: The operation completed successfully.
Service "kubelet" installed successfully!
Set parameter "AppParameters" for service "kubelet".
Set parameter "AppDirectory" for service "kubelet".
kubelet: Unexpected status SERVICE_PAUSED in response to START control.
Service "kube-proxy" installed successfully!
Set parameter "AppDirectory" for service "kube-proxy".
Set parameter "AppParameters" for service "kube-proxy".
Set parameter "DependOnService" for service "kube-proxy".
kube-proxy: START: The operation completed successfully.
If you've already added the services and need to make changes, you can do that by either editing the service or removing them and re-registering with the commands listed below.
PS C:\k> .\nssm.exe edit kubelet
PS C:\k> .\nssm.exe edit kube-proxy
PS C:\k> .\nssm.exe edit flanneld
PS C:\k> .\nssm.exe remove kubelet confirm
PS C:\k> .\nssm.exe remove kube-proxy confirm
PS C:\k> .\nssm.exe remove flanneld confirm
Reboot to verify your node re-registers with kubernetes correctly and that you can deploy a pod using the test above.

Deleting/Re-adding A Windows Node

If you delete a windows node, such as with kubectl delete node <node_name>, adding it is pretty easy. Because the windows nodes have the kubernetes config file they re-register automatically on every service start. You might need to remove existing flannel configuration files and then reboot.
PS C:\k> Remove-Item C:\k\SourceVip.json
PS C:\k> Remove-Item C:\k\SourceVipRequest.json
PS C:\k> Restart-Computer

Broken Kubernetes Things With Windows Nodes

Pretty much everything in broken. You'll be able to deploy a windows container to a windows node using the node selector spec entry like we did when patching the daemonsets above. Just place windows as the OS type instead of linux. Here's a list of things that are broken which I'll update when possible:
  • Persistent Volumes - you need to ensure the node is registered properly with vsphere or nothing will be able to use a persistent volume. This is because vsphere ensures all nodes can see a datastore without making a distinction between windows and linux. I can get a PV to appear on a windows node but I can't get it to initialize properly
  • Node Ports - this is a documented limitation, you can't access a node port service from the node hosting the pod. Strange, yes, but you should be able to use any linux nodes as an entry for any windows pods
  • Load Balancer - With version 0.8.0 it includes the os selector patch, and it should work forwarding connecting through available linux nodes but I haven't had any success yet
  • DNS - untested as of yet because of load balancer problems
  • Logging - should be possible as fluent bit has beta support for windows but untested yet
    • Fluent bit does have some documentation to install under windows, maybe under the node itself as there isn't a docker container readily available, but none of the links work. Perhaps not yet (Aug 2019)
  • Monitoring - should also be possible using WMI exporter rather than a node exporter, again, untested at this time

Friday, July 12, 2019

Kuberentes Infrastructure Overview

I've posted several blog entries to setup various parts of an on-premise Kubernetes installation. This is meant as a summary referencing code posted to github for easy access. You can clone the entire repository, edit the required files and use deploy.sh/cleanup.sh scripts, or run the deployment directly from github as documented below. Each of the headers below is a link to the corresponding blog describing the process in detail.

If you'd like to clone the code run this command.
[root@kube-master ~]# git clone https://github.com/mike-england/kubernetes-infra.git

Cluster Install

While this can be automated through templates or tools like terraform, for now, I recommend following the post specifically for this.








Logging

This setup can be almost entirely automated, but unfortunately you'll need to modify the elasticsearch output in the config file
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-role.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-configmap.yaml
<modify output server entry elasticsearch.prod.int.com entry and index to match your kubernetes cluster name>
[root@kube-master ~]# kubectl create -f fluent-bit-configmap.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-daemon-set.yaml

Load Balancing

Installation from metallb is straight forward. As with logging, you'll need to modify the config map, this time changing the IP range. If you're running a cluster with windows nodes, be sure to patch the metallb daemonset so it doesn't get deployed to any of those nodes.
[root@kube-master ~]# kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/metal-config.yaml
<modify ip address range>
[root@kube-master ~]# kubectl create -f metal-config.yaml
if you're running a mixed cluster with windows nodes
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/node-selector-patch.yaml
[root@kube-master ~]# kubectl patch ds/speaker --patch "$(cat node-selector-patch.yaml)" -n=metallb-system

Monitoring

Assuming you have the load balancer installed above, you should be able to deploy monitoring without any changes.
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-prometheus.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-config-map.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-server.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-node-exporter.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-kube-state.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-kube-state.yaml

DNS Services

Again, with the load balancer in place, this should be deployable as is.
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/dns-namespace.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/etcd.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/external-dns.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/coredns.yaml

Tuesday, July 2, 2019

External DNS For Kubernetes Services

A service isn't useful if you can't access it, and while IP addresses are nice, it doesn't really help deliver user facing services. Really we want DNS, but given the dynamic nature of kubernetes it's impractical to implement the static configurations of the past. To solve that, we're going to implement ExternalDNS for kubernetes which will scan services and ingress points to automatically create and destroy DNS records for the cluster. Of course, nothing is completely simple in kubernetes, so we'll need a few pieces in place:
  • ExternalDNS - the scanning engine to create and destroy DNS records
  • CoreDNS - a lightweight kubernetes based DNS server to respond to client requests
  • Etcd - a key/value store to hold DNS records

Namespace

The first thing we're going to need is a namespace to put things. I normally keep this with one of the key pieces but felt it was better as a separate file in this case.
$ cat dns-namespace.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: dns

Etcd Cluster Setup

Technically we only need one etcd node as we don't really need the data to persist, it'd just be regenerated on the next scan, but it would halt all non-cached dns queries, so, I opted to create 3 instances. I didn't want to use an external etcd discovery service so I needed to have predictable pod names, and in order to do that, we need a stateful set rather than a deployment. If we lost a pod in the stateful set, the pod won't rejoin the cluster without having a persistent volume containing the configuration information, which is why we have a small pv for each.

If you're going to change any of the names, make sure the service name "etcd-dns" exactly matches the stateful set name. If it doesn't, kubernetes won't create an internal DNS record and the nodes won't be able to find each other; speaking from experience.
$ cat etcd.yaml 
apiVersion: v1
kind: Service
metadata:
  name: etcd-dns
  namespace: dns
spec:
  ports:
  - name: etcd-client
    port: 2379
    protocol: TCP
  - name: etcd-peer
    port: 2380
    protocol: TCP
  selector:
    app: etcd-dns
  publishNotReadyAddresses: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-dns
  namespace: dns
  labels:
    app: etcd-dns
spec:
  serviceName: "etcd-dns"
  replicas: 3
  selector:
    matchLabels:
      app: etcd-dns
  template:
    metadata:
      labels:
        app: etcd-dns
    spec:
      containers:
      - name: etcd-dns
        image: quay.io/coreos/etcd:latest
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        env:
        - name: CLUSTER_SIZE
          value: "3"
        - name: SET_NAME
          value: "etcd-dns"
        volumeMounts:
        - name: datadir
          mountPath: /var/run/etcd
        command:
          - /bin/sh
          - -c
          - |
            IP=$(hostname -i)
            PEERS=""
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
                PEERS="${PEERS}${PEERS:+,}${SET_NAME}-${i}=http://${SET_NAME}-${i}.${SET_NAME}:2380"
            done

            exec /usr/local/bin/etcd --name ${HOSTNAME} \
              --listen-peer-urls http://${IP}:2380 \
              --listen-client-urls http://${IP}:2379,http://127.0.0.1:2379 \
              --advertise-client-urls http://${HOSTNAME}.${SET_NAME}:2379 \
              --initial-advertise-peer-urls http://${HOSTNAME}.${SET_NAME}:2380 \
              --initial-cluster-token etcd-cluster-1 \
              --initial-cluster ${PEERS} \
              --initial-cluster-state new \
              --data-dir /var/run/etcd/default.etcd
        ports:
        - containerPort: 2379
          name: client
          protocol: TCP
        - containerPort: 2380
          name: peer
          protocol: TCP
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi
Cluster initialization is the more complicated part in this set. We're running some shell commands within the newly booted pod to fill in the required values with the PEERS variable looking like this when it's done. Could you hard code it? Sure, but that would complicate things if you change the set name or number of replicas. You can also do lots and lots of fancy stuff to remove, add, or rejoin nodes but we don't really need more than an initial static value (three in this case) so I'll leave things simple. You can check out the links in the notes section for more complicated examples.
etcd-dns-0=http://etcd-dns-0.etcd-dns:2380,etcd-dns-1=http://etcd-dns-1.etcd-dns:2380,etcd-dns-2=http://etcd-dns-2.etcd-dns:2380
If you'd like to enable https on your etcd cluster, you can easily do so by adding --auto-tls and --peer-auto-tls but this will create problems getting coredns and external-dns to connect without adding the certs there too.

CoreDNS Setup

As the end point to actually serve client requests, this is also an important piece to ensure it stays running, however, we don't really care about the data as it's backed by etcd. So, to handle this, we'll use a 3 pod deployment with a front end service. This uses a service type of LoadBalancer making it easily available to clients, so make sure you have that available. If you don't, see a previous post to install and configure MetalLB.

You might also notice that we're opening up both TCP and UDP DNS ports but only exposing UDP from the load balancer. This is largely because a load balancer can't implement both UDP and TCP at the same time, so feel free to remove TCP if you like. At some point I have hope multi protocol load balancers will be easier to manage so for now I'm leaving it in.
$ cat coredns.yaml 
apiVersion: v1
kind: Service
metadata:
  name: coredns
  namespace: dns
spec:
  ports:
  - name: coredns
    port: 53
    protocol: UDP
    targetPort: 53
  selector:
    app: coredns
  type: LoadBalancer
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: dns
data:
  Corefile: |
    . {
        errors
        health
        log
        etcd {
           endpoint http://etcd-dns:2379
        }
        cache 30
        prometheus 0.0.0.0:9153
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: dns
  labels:
    app: coredns
spec:
  replicas: 3
  selector:
    matchLabels:
      app: coredns
  template:
    metadata:
      labels:
        app: coredns
        k8s_app: kube-dns
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9153"
        prometheus.io/path: /metrics
    spec:
      containers:
      - name: coredns
        image: coredns/coredns:latest
        imagePullPolicy: IfNotPresent
        args: [ "-conf", "/etc/coredns/Corefile" ]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
      volumes:
      - name: config-volume
        configMap:
          name: coredns
          items:
          - key: Corefile
            path: Corefile
There are quite a few plugins [https://coredns.io/plugins/] you can apply to your coredns implementation, some of which you might want to play with. The documentation for these is quite good and easy to implement; they'd go in the ConfigMap with the errors and health entry. Just add the plugin name and any parameters they might take on a line and you're good to go. You may want to remove the log entry if your dns server is really busy or you don't want to see the continual stream of dns updates.

I'll also make special mention of the . { } block in the config map. This tells coredns to accept an entry for any domain which might not be to your liking. In my opinion, this provides the most flexibility as this shouldn't be your site's primary DNS server. Requests for a specific domain or subdomain should be forwarded here from your primary DNS, however, if you want to change this you'd simply enter one or more blocks such as example.org { } instead of . { }.

External DNS

Finally, the reason where here, deploying external-dns to our cluster. A couple of notes here; I've selected to scan the cluster for new or missing services every 15 seconds. This makes the DNS system feel very snappy when creating a service but might be too much or too little for your environment. I found the documentation particularly frustrating here. The closest example I found using coredns leverages minikube with confusing options and commands to diff a helm chart which doesn't feel very complete or intuitive to me.
$ cat external-dns.yaml 
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get","watch","list"]
- apiGroups: ["extensions"]
  resources: ["ingresses"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: registry.opensource.zalan.do/teapot/external-dns:latest
        args:
        - --source=service
        - --source=ingress
        - --provider=coredns
        - --registry=txt
        - --log-level=info
        - --interval=15s
        env:
          - name: ETCD_URLS 
            value: http://etcd-dns:2379
I've left the log-level entry in although the default is info anyway as it's a helpful placeholder when you want/need to change it. The log options, which I couldn't find any documentation for and had to look within the code are: panic, debug, info, warning, error, fatal. You'll also notice a reference to our Etcd cluster service here so if you've changed that name make sure you change it here too.

Deployment and Cleanup Scripts

As I like to do, here are some quick deployment and cleanup scripts which can be helpful when testing over and over again:
$ cat deploy.sh 
kubectl create -f dns-namespace.yaml
kubectl create -f etcd.yaml
kubectl create -f external-dns.yaml
kubectl create -f coredns.yaml
As a reminder, deleting the namespace will cleanup all the persistent volumes too. All of the data will be recreated on the fly but it means a few extra seconds for the system to reclaim them and recreate when you deploy.
$ cat cleanup.sh 
kubectl delete namespace dns
kubectl delete clusterrole external-dns
kubectl delete clusterrolebinding external-dns-viewer

Success State

I also had trouble finding out what good looked like so here's what you're looking for in the logs:
$ kubectl logs -n dns external-dns-57959dcfd8-fgqpn
time="2019-06-27T01:45:21Z" level=error msg="context deadline exceeded"
time="2019-06-27T01:45:31Z" level=info msg="Add/set key /skydns/org/example/nginx/66eeb21d to Host=10.9.176.196, Text=\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/default/nginx-frontend\", TTL=0"
The actual pod name will be different for you as we used a deployment. You can get the exact name using kubectl get pods -n dns. In this example, the "context deadline exceeded" is bad. It means external dns wasn't able to register the entry with etcd, in this case because that cluster was still booting. The last line shows a successful update into etcd.

Etcd has too much to post here, but you'll see entries indicating it can't resolve a host as they boot up, and potentially several MsgVote requests as the services start on all pods. In the end it should establish a peer connection with all of the nodes and indicate the api is enabled.
$ kubectl logs -n dns etcd-dns-0
2019-06-27 01:45:15.124897 W | rafthttp: health check for peer c77fa62c6a3a8c7e could not connect: dial tcp: lookup etcd-dns-1.etcd-dns on 10.96.0.10:53: no such host
2019-06-27 01:45:15.128194 W | rafthttp: health check for peer dcb7067c28407ab9 could not connect: dial tcp: lookup etcd-dns-2.etcd-dns on 10.96.0.10:53: no such host

2019-06-27 01:45:15.272084 I | raft: 7300ad5a4b7e21a6 received MsgVoteResp from 7300ad5a4b7e21a6 at term 4
2019-06-27 01:45:15.272096 I | raft: 7300ad5a4b7e21a6 [logterm: 1, index: 3] sent MsgVote request to c77fa62c6a3a8c7e at term 4
2019-06-27 01:45:15.272105 I | raft: 7300ad5a4b7e21a6 [logterm: 1, index: 3] sent MsgVote request to dcb7067c28407ab9 at term 4
2019-06-27 01:45:17.127836 E | etcdserver: publish error: etcdserver: request timed out

2019-06-27 01:45:41.087147 I | rafthttp: peer dcb7067c28407ab9 became active
2019-06-27 01:45:41.087174 I | rafthttp: established a TCP streaming connection with peer dcb7067c28407ab9 (stream Message writer)
2019-06-27 01:45:41.098636 I | rafthttp: established a TCP streaming connection with peer dcb7067c28407ab9 (stream MsgApp v2 writer)
2019-06-27 01:45:42.350041 N | etcdserver/membership: updated the cluster version from 3.0 to 3.3
2019-06-27 01:45:42.350158 I | etcdserver/api: enabled capabilities for version 3.3
If your cluster won't start or ends up in a CrashLoopBackOff, most of the time I found the problem to be host resolution (dns). You can try changing the PEER entry from ${SET_NAME}-${i}.${SET_NAME} to just ${SET_NAME}. This won't let the cluster work, but should let you get far enough to see what's going on inside the pod. I'd also recommend setting the replicas to 1 when troubleshooting.

CoreDNS is pretty straight forward. It'll just log a startup and then client queries which looks like these examples, where the first response, nginx.example.org, returns noerror (this is good) and the second, ngingx2.example.org, returning nxdomain meaning the record doesn't exist. Again, if you want to cut down on these messages remove the log line from the config file as stated above
$ kubectl logs -n dns coredns-6c8d7c7d79-6jm5l
.:53
2019-06-27T01:44:44.570Z [INFO] CoreDNS-1.5.0
2019-06-27T01:44:44.570Z [INFO] linux/amd64, go1.12.2, e3f9a80
CoreDNS-1.5.0
linux/amd64, go1.12.2, e3f9a80
2019-06-27T02:11:43.552Z [INFO] 192.168.215.64:58369 - 10884 "A IN nginx.example.org. udp 35 false 512" NOERROR qr,aa,rd 68 0.002999881s
2019-06-27T02:13:08.448Z [INFO] 192.168.215.64:64219 - 40406 "A IN nginx2.example.org. udp 36 false 512" NXDOMAIN qr,aa,rd 87 0.007469218s

Using External DNS

To actually have a DNS name register with external DNS, you need to add an annotation to your service. Here's one for nginx that would register an external load balancer and that IP with the name nginx.example.org
$ cat nginx-service.yaml 
apiVersion: v1
kind: Service
metadata:
  name: nginx-frontend
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "nginx.example.org"
spec:
  ports:
  - name: "web"
    port: 80
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
From a linux or mac host, you can use nslookup to verify the entry where 10.9.176.212 is the IP of my coredns service.
$ kubectl get svc -n dns
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)             AGE
coredns    LoadBalancer   10.100.208.145   10.9.176.212   53:31985/UDP        20h
etcd-dns   ClusterIP      10.100.83.154    <none>         2379/TCP,2380/TCP   20h
$ nslookup nginx.example.org 10.9.176.212
Server:  10.9.176.212
Address: 10.9.176.212#53

Name: nginx.example.org
Address: 10.9.176.213

Notes

Kubernetes already comes with an etcd, and for newer releases, coredns, so why not use those? We'll you probably can but, in my opinion, these are meant for core cluster functions and we shouldn't be messing around with them, and, they're secured with https so you'd need to go through the process of getting certificates set up. While I didn't find any links that really suited my needs, here are some that helped me along, maybe they'll help you too.

Thursday, June 27, 2019

LoadBalanced Kubernetes

Up to now we've been using a NodePort as the access to services. This can have a few significant drawbacks:
  • If you have multiple pods providing a service it can be difficult or impossible for clients to use them all effectively
  • You cannot predict the port hosting your application and that port will change every time you deploy. For example, instead of getting port 443 for each application you'd get a random port assigned between 30,000 and 32,767
Public cloud providers have their own load balancer solutions, which are generally efficient and transparent but when using on-premise or "bare metal" we need more software or hardware to do this. MetalLB is a great solution for this; it's software only, free, easy to install and configure, and while not perfect, does a good job for most use cases.

I've documented my steps for reference but I encourage you to review the official documentation [https://metallb.universe.tf/installation/]. It's well written and about as straight forward as you can get.

MetalLB Install

Not much to say here. Run the official install, be happy.
kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.8.0/manifests/metallb.yaml

MetalLB Configuration

I've opted for the simpler and universal L2 load balancing mechanism. It might not be perfect but I don't need to get the network team engaged and it works well enough for my use case. Again, the documentation [https://metallb.universe.tf/configuration/] is well written and straight forward. Here's my setup in case you want to see it.
$ cat metal-config.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: default
      protocol: layer2
      addresses:
      - 10.9.176.10-10.9.176.250
You can also setup multiple pools each with a different name but this is more complicated than I need.

Using LoadBalancer

If you have just the one address pool then it's as simple as specifying LoadBalancer as your service type. If you have multiple pools then you'll need to annotate your service. Again, the documentation is clear and helpful [https://metallb.universe.tf/usage/. Once the service is deployed you should have an external-ip assigned and from there you can dynamically assign a DNS address as we'll be talking about in the next article.

Wednesday, June 19, 2019

Monitoring Kubernetes With Prometheus

There are several ways to monitor a kubernetes cluster, some free, some paid, and some specific to a vendor's cluster implementation. If you're looking for the easy button, look at purchasing a solution; there are many out there, however, for this article, we'll be deploying something slightly more confusing but free using Prometheus as a monitoring solution.
Let's start with a few concepts:
  • Prometheus pulls data through a process called a scrape. Scraping is a handy approach as you don't need agents pushing data everywhere but can limit your scalability
  • Prometheus uses metric endpoints which are configured using jobs. We'll be installing a couple of end points to look at specific pieces of the infrastructure, but if you want your applications to be included, they'll need to support prometheus and have specific tags setup in their deployment file
  • There's lots of documentation about Prometheus' ability to self monitor and automatically pick up new end points; this is awesome
There are many different endpoints you can choose and things can become confusing very quickly, so hopefully this guide will give you a base implementation to expand on as you see fit for your environment. To do this there will be three base components we'll be setting up:
  • prometheus - the collector and repository of metric data; obviously
  • kube-state-metrics - deep kubernetes metrics such as pod and node utilization. This is kind of like a more detailed version of metrics-server which is itself a replacement for Heapster. It's confusing, I know.
  • node_exporter - collected detailed node metrics. There will be one endpoint per node in the cluster

Namespace And Role Setup

I like the idea of keeping my monitoring pieces in its own namespace so we'll be creating a monitoring namespace and creating some roles for the different components to use. I've included that namespace creation during the prometheus cluster role setup, if you're doing things differently, make sure to take that into account. They're long, sorry, and suitable for github, but I wanted to make this as easy to follow as possible:
$ cat clusterRole-prometheus.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
And our cluster role and service account setup for kube-state-metrics.
$ cat clusterRole-kube-state.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - ingresses
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - statefulsets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
- apiGroups: ["policy"]
  resources:
  - poddisruptionbudgets
  verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
  resources:
  - certificatesigningrequests
  verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources:
  - storageclasses
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring

Prometheus Configuration

The next thing we'll need is a configuration map for Prometheus itself. I've elected to collect data every minute, there are examples where people are collecting every 5 seconds, pick what makes sense to you. The scrape jobs are largely specific to the type of end point in use. If you're within the same cluster, the certificates and bearer tokens will automatically be pulled into the appropriate containers but if not you'll need to reference them directly which is out of scope for this article.
$ cat prometheus-config-map.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 1m
      evaluation_interval: 1m
      scrape_timeout: 10s
    rule_files:
      - /etc/prometheus/prometheus.rules
    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secretes/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-nodes-cadvisor'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name

Prometheus Deployment

Once the configuration file has been setup it's time to deploy Prometheus itself. This configuration is using a deployment, you could easily convert this to a stateful set if you like. Please note the persistent volume created here belongs to the monitoring namespace, so if you clean that namespace, for example to reload the environment, you will wipe the old data. Also, keep in mind the storage class in use, I'm using vsphere-ssd from my original blog post which might not be suitable for your environment. I'm also using a NodePort as it's always available and doesn't require additional network components but the inbound port will change every time you deploy so that might not be ideal long term.
$ cat prometheus-server.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: monitoring
  labels:
    app: prometheus
spec:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: "vsphere-ssd"
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      securityContext:
        fsGroup: 65534
      containers:
      - name: prometheus
        image: prom/prometheus:v2.10.0
        volumeMounts:
          - name: prometheus-config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yml
          - name: data
            mountPath: /prometheus
        ports:
        - containerPort: 9090
      volumes:
        - name: prometheus-config-volume
          configMap:
           name: prometheus-server-conf
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-data
      serviceAccountName: prometheus
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
spec:
  selector:
    app: prometheus-server
  ports:
  - name: promui
    protocol: TCP
    port: 9090
    targetPort: 9090
  type: NodePort

End Point Deployment

Our last two configuration files are to set up a deployment and internal service for kube-state-metrics and a daemon set and internal service for node-exporter. Because this is a daemon set, it'll get a copy on every node automatically when one is added. Self managed monitoring!
$ cat prometheus-kube-state.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.6.0
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    k8s-app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics
$ cat prometheus-node-exporter.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    k8s-app: node-exporter
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v0.18.1
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
      version: v0.18.1
  updateStrategy:
    type: OnDelete
  template:
    metadata:
      labels:
        k8s-app: node-exporter
        version: v0.18.1
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      containers:
        - name: prometheus-node-exporter
          image: prom/node-exporter:v0.18.1
          imagePullPolicy: "IfNotPresent"
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
          ports:
            - name: metrics
              containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly:  true
            - name: sys
              mountPath: /host/sys
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
      hostNetwork: true
      hostPID: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
---
apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "NodeExporter"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9100
      protocol: TCP
      targetPort: 9100
  selector:
    k8s-app: node-exporter

Deployment and Cleanup Scripts

There are a lot of yaml files here and things get annoying when you want to deploy, test, clean, repeat, so here are two simple scripts that will deploy all of these files and then clean everything up again if you need them:
$ cat deploy.sh 
kubectl create -f clusterRole-prometheus.yaml
kubectl create -f prometheus-config-map.yaml
kubectl create -f prometheus-server.yaml
kubectl create -f prometheus-node-exporter.yaml
kubectl create -f clusterRole-kube-state.yaml
kubectl create -f prometheus-kube-state.yaml
The bulk of the cleanup happens when you delete the namespace, but remember, this will also delete the persistent volume hosting the prometheus data.
$ cat cleanup.sh 
kubectl delete namespace monitoring
kubectl delete clusterrolebinding prometheus
kubectl delete clusterrole prometheus
kubectl delete clusterrolebinding kube-state-metrics
kubectl delete clusterrole kube-state-metrics

Accessing Prometheus

You can check the pods and get your access point using these commands. In my cluster, I've got two worker nodes so I've got two node-exporters.
$ kubectl get pods -n monitoring
NAME                                  READY   STATUS    RESTARTS   AGE
kube-state-metrics-699fdf75f8-cqq5t   1/1     Running   0          18h
node-exporter-88297                   1/1     Running   0          18h
node-exporter-wb2lk                   1/1     Running   0          18h
prometheus-server-6f9d9d86d4-m8x4f    1/1     Running   0          18h
And the all important services are here.
$ kubectl get svc -n monitoring
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
kube-state-metrics   ClusterIP   10.110.207.159   <none>        8080/TCP,8081/TCP   18h
node-exporter        ClusterIP   None             <none>        9100/TCP            18h
prometheus-service   NodePort    10.101.114.89    <none>        9090:30044/TCP      18h
You can see our NodePort service, which means I can hit either node with the port 30044 and be redirected to my pod on port 9090. Let's try that in a browser: http://k8s-n2.itlab.domain.com:30044/graph, where you should be presented with the Prometheus dashboard.

If you click on Status > Targets you should see everything Prometheus is scraping and you hover of the Labels you will see a lot of information including job="job_name". This can be particularly useful to tie back to your config map, especially when some things show as down.

If you have been following my logging article, there are annotations in there for prometheus, which are now helpful, as any metrics fluentbit provides are automatically added. Targets should look something like this with all endpoints in a state of up.

Grafana Integration

The last step in this guide is where the real work begins. If you don't have a Grafana instance up and running, I'd suggest setting one up on a separate linux box. There are some excellent guides with essentially one command to run: https://grafana.com/docs/installation/rpm/.

After that's done, add a data source of type Prometheus with the web URL you used to access the dashboard above. In my case, http://k8s-n2.itlab.domain.com:30044, and you can start creating dashboards. There are also some prebuilt ones that have been publicly hosted which you can find on Grafana's website, https://grafana.com/dashboards and add them by simply placing the dashboard ID into the grafana import as documented by Grafana: https://grafana.com/docs/reference/export_import/.

Notes

This is just scratching the surface of monitoring but hopefully it gives you enough of a framework to build from. Alert manager is a key piece yet to be discussed, long term storage, and of course dashboards, lots and lots of dashboards. Some articles I found particularly helpful:

Friday, May 24, 2019

Kubernetes Logging With Fluent Bit

Centralized logging is a requirement in any kubernetes installation. With the defaults in place, logs will be kept on each node and won't persist as pods come and go. This is a big problem, as you really need your logs when things go wrong, not when they're working well, so I set out to establish a central log. All of the examples I could find referenced keeping logs in /var/log/containers/*.log which is great, but not in use on a systemd system. Since pretty much every linux distribution uses systemd these day, this is my attempt to provide a logging configuration to support my base install.

There are several ways to log, which can seem confusing, as to me, most of these methods are way too complicated and error prone. Kubernetes.io has this to say:
"Because the logging agent must run on every node, it’s common to implement it as either a DaemonSet replica, a manifest pod, or a dedicated native process on the node. However the latter two approaches are deprecated and highly discouraged."
So, using a DaemonSet is the way to go. With what? We'll fluent bit is the easiest way I've found, once you have a working configuration file for your setup. Originally I wanted to use Graylog as my collection and presentation layer, but have found that the the fluent pieces just aren't mature enough to deal with GELF correctly, so I eventually settled on an ELK stack; much better.

There are some excellent tutorials on how to install elastic search, logstash, and kibana on the web. If you're using the repositories, it can't get much easier so please follow one of those, https://computingforgeeks.com/how-to-install-elk-stack-on-centos-fedora/ is a good example.

Namespace And Role Setup

I like namespaces, I like roles, so the first thing to do is set those up. Because we're going to be running a daemon set to catch all pods and node activities we need to setup a cluster role and corresponding cluster role binding like this:
$ cat fluent-bit-role.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: logging
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: logging

Fluent Bit Configuration

A ConfigMap seems to be the most popular way to manage Fluent Bit's configuration. This is by no means exhaustive but should provide a pretty good template. A couple of general notes:
  • The name of the input or filter determines its type. It feels like just a string but it isn't, you're actually defining type with this field
  • Documentation is pretty good once you figure out which plugin you're dealing with, for example, Systemd document
  • There's no magical way to get the logs, the daemon set needs access to the nodes logs, including systemd. For centos7 this is kept under /run/log/journal/<UUID>/*.journal. We'll talk about this again when building the daemon set itself
I wanted to "enrich" my node logs with kubernetes information as without it you'll be missing some key information like namespace, container name, and other app specific labels. To do this you need a kubernetes filter, with the name kubernetes, remember this is a type not just a name, and a match entry within that needs to align with the input you'd like to pair it with. In this example there's only one input and one filter so hopefully it makes sense. The Kube_URL entry is whatever URL you can reach the kubernetes management API. It'll be queried to fill in the missing pieces merging that data with this log entry. It should be visible with 'kubectl get svc'
$ cat fluent-bit-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
  labels:
    k8s-app: fluent-bit
data:
  # Configuration files: server, input, filters and output
  # ======================================================
  fluent-bit.conf: |
    [SERVICE]
        Flush              1
        Log_Level          info
        Daemon             off
        HTTP_Server        On
        HTTP_Listen        0.0.0.0
        HTTP_Port          2020

    @INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-elasticsearch.conf

  input-kubernetes.conf: |
    [INPUT]
        Name                systemd
        Tag                 kube_systemd.*
        Path                /run/log/journal
        DB                  /var/log/flb_kube_systemd.db
        Systemd_Filter      _SYSTEMD_UNIT=docker.service
        Read_From_Tail      On
        Strip_Underscores   On

  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube_systemd.*
        Kube_URL            https://kubernetes.default.svc:443
        Annotations         On
        Labels              On
        Merge_Log           On
        K8S-Logging.Parser  On
        Use_Journal         On
    
  output-elasticsearch.conf: |
    [OUTPUT]
        Name                es
        Match               *
        Host                elasticsearch.prod.int.com
        Port                9200
        Index               k8s-lab

Fluent Bit DaemonSet

At last we'll configure fluent bit as a daemon set. Some general notes here:
  • I found it's possible to debug quickly by running a -debug variant, but anything newer than 1.0.4-debug lacked the ability to run /bin/sh, replacing it instead with busybox, which seemed more complicated for my needs.
    • To run a debug version you'd use an image like this image: fluent/fluent-bit:1.0.4-debug
  • As mentioned in the config map section, you need to mount the node's systemd location; in my case, /run/log
$ cat fluent-bit-daemon-set.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    k8s-app: fluent-bit-logging
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      name: fluent-bit
  template:
    metadata:
      labels:
        name: fluent-bit
        k8s-app: fluent-bit-logging
        version: v1
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:1.1.1
        imagePullPolicy: Always
        ports:
          - containerPort: 2020
        volumeMounts:
        - name: systemdlog
          mountPath: /run/log
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      terminationGracePeriodSeconds: 10
      volumes:
      - name: systemdlog
        hostPath:
          path: /run/log
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

Deploying

Now comes the easy part, deploying all the yaml files:
$ kubectl create -f fluent-bit-role.yaml
$ kubectl create -f fluent-bit-configmap.yaml
$ kubectl create -f fluent-bit-daemon-set.yaml
And that's it. You should start to see logs coming into your elasticsearch instance and consequentially Kibana which you can start to visualize and create dashboards for.

Clean Up

If you need to remove everything for testing or other purposes, namespaces make this really easy. The only piece remaining is the role / binding, all of which can be removed like this:
$ kubectl delete namespace logging
$ kubectl delete clusterrolebinding fluent-bit-read
$ kubectl delete clusterrole fluent-bit-read