Wednesday, July 24, 2019

Windows Kubernetes Nodes

It's happened, someone has asked you for a windows container. The first piece of advice to give; find a way, any way possible to run the service within a linux container. Not being knowledgeable with linux shouldn't be an excuse as most containers require very little expertise with linux anyway. Help the windows developers migrate to linux, everyone will be happier.

That being said, sometimes re-coding a service for linux just isn't practical. And while things are pretty bad now, Microsoft is a really big company who seems to want into this market; they'll put effort into making things better over time. As an example of this, their documentation is quite good, way better than most of the linux documentation in my opinion. Have a look at it as portions of this guide are lifted directly from it [https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/getting-started-kubernetes-windows]

Cluster Setup

You'll need at least one linux box to serve as a master, although we're going to use a few more to host other infrastructure services. You can follow my previous blog post with one exception; you can't run Calico as Container Network Interface (CNI). Well, technically you can, but Calico for windows is provided only as a subscription service and Microsoft only documents networking with Flannel as the CNI, so that's what we'll use here.

**When you initialize the cluster, be sure to use Flannel's default pod cidr of 10.244.0.0/16 or you'll have problems when setting up microsoft nodes**
[root@kube-master ~]# kubeadm init --pod-network-cidr=10.244.0.0/16
Setting up Flannel is pretty easy you'll download the flannel yaml file and make some Microsoft specific changes, notably the VNI and PORT number as documented on github and from microsoft.
[root@kube-master ~]# wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Within the ConfigMap section you'll have the following under net-conf.json:
net-conf.json: |
{
  "Network": "10.244.0.0/16",
  "Backend": {
   "Type": "vxlan"
  }
}
We need to add the required VNI and Port information like this:
net-conf.json: |
{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan",
    "VNI": 4096,
    "Port": 4789
  }
}
And install Flannel like this:
[root@kube-master ~]# kubectl create -f kube-flannel.yml

CNI Notes

What's wrong with running Flannel for all clusters? Flannel has a separate private network allocated to each node which is then encapsulated within UDP and passer to other nodes within the cluster. Microsoft supports two Flannel modes, vxlan mode (documented here) which creates a virtual overlay network to handle routes between nodes automatically, and host-gateway mode, which seems insane to me as you need a static route on each node to every other node's pod subnet; so I don't recommend that.

Calico, on the other hand, uses simple L3 routing within the cluster so it's much easier to see where traffic is going and where it came from. I like the ideal of Calico better, but it isn't a real option without a subscription so I'll stick with Flannel on my windows cluster. There are a few decent articles on the differences between the two:

Windows Nodes

You'll need to install Windows 2019. I use 2019 standard with desktop experience as I like to RDP to the box but maybe you're an extreme windows guru and can do all of this without. I've disabled the firewall and installed vmware tools. Joining a domain is entirely optional as we aren't going to use any of the domain services. If you do join, make sure you treat this as a high performance server, so take care with patch schedules and extra windows features like virus scanning. You'll also need to ensure your patch level is high enough. I recommend running microsoft update, again, and again as you'll get new patches after a reboot. The version you're running should be at least 17763.379 as provided by KB4489899. You can find this by running winver.

As mentioned before, Microsoft has done a good job documenting the steps so feel free to follow along there too. Everything should be done with an elevated powershell prompt (run as administrator). These first steps will add the repository and install docker.
PS C:\Users\Administrator> Install-Module -Name DockerMsftProvider -Repository PSGallery
PS C:\Users\Administrator> Install-Package -Name Docker -ProviderName DockerMsftProvider
Reboot the machine and check docker is running properly
PS C:\Users\Administrator> Restart-Computer
PS C:\Users\Administrator> docker version
Client: Docker Engine - Enterprise
 Version:           19.03.0
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        87b1f470ad
 Built:             07/16/2019 23:41:30
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.0
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.5
  Git commit:       87b1f470ad
  Built:            07/16/2019 23:39:21
  OS/Arch:          windows/amd64
  Experimental:     false
If you get an error here that looks like this:
error during connect: Get http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.39/version: open //./pipe/docker_engine: The system cannot find the file specified. In the default daemon configuration on Windows, the docker client must be run elevated to connect. This error may also indicate that the docker daemon is not running.
It just means that docker service didn't start on boot. Start it from services or from powershell using Start-Service docker

Create Pause Image

A pause image is also run on your linux nodes, but automatically; we need to do that manually here including downloading it, tagging it, and check that it runs correctly.
PS C:\Users\Administrator> docker pull mcr.microsoft.com/windows/nanoserver:1809
PS C:\Users\Administrator> docker tag mcr.microsoft.com/windows/nanoserver:1809 microsoft/nanoserver:latest
PS C:\Users\Administrator> docker run microsoft/nanoserver:latest
Microsoft Windows [Version 10.0.17763.615]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\>

Download Node Binaries

You'll need several binaries available from Kubernetes' github page. The version should match the server as close as possible. The official skew policy can be found at kubernetes.io, and if you want to see your client and server version you can use this command.
[root@kube-master ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
This is saying my client and server are version 1.15.1. To download the corresponding client version, you can use this link [https://github.com/kubernetes/kubernetes/releases/], select the CHANGELOG-<version>.md link and download the node binaries for windows. In this case the latest is 1.15.1 , so that works out well.

I've used unix to expand the node binaries, either mac or your master node would work fine using tar zxvf kubernetes-node-windows-amd64.tar.gz but you can also use windows with expand-archive. Once that's done you'll need to copy all the executables under the expanded kubernetes/node/bin/* to c:\k. I know lots of people will want to change that \k folder but don't. Microsoft has hard coded it into many scripts we'll be using. So save yourself headache and just go with it.

You'll also need to grab /etc/kubernetes/admin.conf from the master node and place that in c:\k too and download Microsoft's start script. For all of these, I used a shared folder within my RDP session but winSCP is also a good tool if you don't mind installing more software on your worker nodes. It should look like this when you're done.
PS C:\Users\Administrator> mkdir c:\k
PS C:\Users\Administrator> wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/start.ps1 -o c:\k\start.ps1
<download and transfer kubernetes node binaries and config file>
PS C:\k> dir
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        7/23/2019   2:12 PM           5447 config
-a----        7/18/2019   2:55 AM       40072704 kube-proxy.exe
-a----        7/18/2019   2:55 AM       40113152 kubeadm.exe
-a----        7/18/2019   2:55 AM       43471360 kubectl.exe
-a----        7/18/2019   2:55 AM      116192256 kubelet.exe
-a----        7/23/2019   2:01 PM           2447 start.ps1

Joining A Windows Node

You're finally ready to join a windows node! Again you can have a look at the documentation but if you've been following along, you'll only need two options.
  • ManagementIP - this is unfortunate as it'll require more scripting when you're ready to automate. It's the IP address of this worker node which you can get from ipconfig on your windows node
  • NetworkMode - we're using vxlan and the default is l2bridge so this will need to be set to overlay
Other fields that should be fine with defaults but you can check them with these commands
  • ServiceCIDR - verify with kubectl cluster-info dump | grep -i service-cluster
  • ClusterCIDR - check with kubectl cluster-info dump | grep -i cluster-cidr
  • KubeDnsServiceIP - verify the default (10.96.0.10) with kubectl get svc -n kube-system. Cluster-IP is the field you're interested in.
When you run the start.ps1 script it'll download a lot of additional scripts and binaries eventually spawning a few new powershell windows leaving the logging one open, which can be very helpful at this stage. Run the following replacing the IP in blue with your local windows server IP address (from ipconfig)
PS C:\k> .\start.ps1 -ManagementIP 10.9.176.94 -NetworkMode overlay

Initial Problems

I had trouble getting the kubelet process to start. You'll notice the node doesn't go ready and if you look at the processes it will have flannel and kube-proxy but no kubelet. It seems the start-kubelet.ps1 script that's downloaded is using outdated flags, so to fix that, remove the highlighted --allow-privileged=true from start-kubelet.ps1.
$kubeletArgs = @(
    "--hostname-override=$(hostname)"
    '--v=6'
    '--pod-infra-container-image=mcr.microsoft.com/k8s/core/pause:1.0.0'
    '--resolv-conf=""'
    '--allow-privileged=true'
    '--enable-debugging-handlers'
    "--cluster-dns=$KubeDnsServiceIp"
    '--cluster-domain=cluster.local'
    '--kubeconfig=c:\k\config'
    '--hairpin-mode=promiscuous-bridge'
    '--image-pull-progress-deadline=20m'
    '--cgroups-per-qos=false'
    "--log-dir=$LogDir"
    '--logtostderr=false'
    '--enforce-node-allocatable=""'
    '--network-plugin=cni'
    '--cni-bin-dir="c:\k\cni"'
    '--cni-conf-dir="c:\k\cni\config"'
    "--node-ip=$(Get-MgmtIpAddress)"
)

I also had a problem when provisioning persistent volumes even though they weren't for the windows node. If kubernetes can't identify all nodes in the cluster, it won't do anything. The error looks like this
I0723 23:57:11.379621       1 event.go:258] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"715df13c-8eeb-4ba4-9be1-44c8a5f03071", APIVersion:"v1", ResourceVersion:"480073", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vsphere-ssd": No VM found
E0723 23:59:26.375664       1 datacenter.go:78] Unable to find VM by UUID. VM UUID: 
E0723 23:59:26.375705       1 nodemanager.go:431] Error "No VM found" node info for node "kube-w2" not found
E0723 23:59:26.375718       1 vsphere_util.go:130] Error while obtaining Kubernetes node nodeVmDetail details. error : No VM found
E0723 23:59:26.375727       1 vsphere.go:1291] Failed to get shared datastore: No VM found
E0723 23:59:26.375787       1 goroutinemap.go:150] Operation for "provision-default/test[715df13c-8eeb-4ba4-9be1-44c8a5f03071]" failed. No retries permitted until 2019-07-24 00:01:28.375767669 +0000 UTC m=+355638.918509528 (durationBeforeRetry 2m2s). Error: "No VM found"
I0723 23:59:26.376127       1 event.go:258] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"715df13c-8eeb-4ba4-9be1-44c8a5f03071", APIVersion:"v1", ResourceVersion:"480073", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vsphere-ssd": No VM found
And my eventual solution was to reboot the master node. Sad, yes.

Updating A Node UUID

Like our linux nodes, you'll need to patch the node spec with the UUID of the node. Under windows you can retrieve that UUID with the following command but you'll need to reformat it.
PS C:\k> wmic bios get serialnumber
SerialNumber
VMware-42 3c fe 01 af 23 a9 a5-65 45 50 a3 db db 9d 69
And back on our kubernetes master node we'd patch the node like this
[root@k8s-master ~]# kubectl patch node <node_name> -p '{"spec":{"providerID":"vsphere://423CFE01-AF23-A9A5-6545-50A3DBDB9D69"}}'

Patching DaemonSets

A DaemonSet gets a pod pushed to every node in the cluster. This is generally bad because most things don't run on windows, so to prevent that you'll need to patch existing sets and use the node selector for application you produce. You can download the patch from microsoft or create your own file, it's pretty basic. If you've been following along with the code provided on github, those files already have the node selector set.
[root@kube-master ~]# wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/l2bridge/manifests/node-selector-patch.yml
[root@kube-master t]# cat node-selector-patch.yml 
spec:
  template:
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
We'll need to apply it to existing DaemonSets, notably kube-proxy and kube-flannel-ds-amd64.
[root@kube-master ~]# kubectl patch ds/kube-flannel-ds-amd64 --patch "$(cat node-selector-patch.yml)" -n=kube-system
[root@kube-master ~]# kubectl patch ds/kube-proxy --patch "$(cat node-selector-patch.yml)" -n=kube-system
If you've been getting errors on your windows node from flannel saying things like Error response from daemon: network host not found and Error: no such container, those should now stop.

Deploying A Test Pod

I'd suggest using the Microsoft provided yaml file although I reduced the number of replicas to 1 to simplify any troubleshooting.
[root@kube-master ~]# wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/l2bridge/manifests/simpleweb.yml -O win-webserver.yaml
[root@kube-master ~]# kubectl apply -f win-webserver.yaml
[root@kube-master ~]# kubectl get pods -o wide

Registering A Service

Every time you reboot you'll need to run the start command manually, which isn't all that useful. Microsoft has created some excellent instructions and a script to register the required services using the Non-Sucking Service Manager. Follow the instructions provided by Microsoft, which is basically placing both the sample script, called register-svc.ps1, and nssm.exe binary into c:\k.
PS C:\k> wget https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/flannel/register-svc.ps1 -o c:\k\register-svc.ps1
I did have problems with the default script as it seems to reference an incorrect pause image and have a problem with the allow-privileged statement as indicated above. To fix that, edit register-svc.ps1 and under the kubelet registration replace the --pod-infra-container=kubeletwin/pause with mcr.microsoft.com/k8s/core/pause:1.0.0 and remove --allow-privileged=true. It should be line 25 and will look like this when you're done;
.\nssm.exe set $KubeletSvc AppParameters --hostname-override=$Hostname --v=6 --pod-infra-container-image=mcr.microsoft.com/k8s/core/pause:1.0.0 --resolv-conf="" --enable-debugging-handlers --cluster-dns=$KubeDnsServiceIP --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false  --log-dir=$LogDir --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config
Once that's fixed, you can register your services with this command where ManagementIP is the windows node IP.
PS C:\k> .\register-svc.ps1 -ManagementIP <windows_node_ip> -NetworkMode overlay
You should see the services registered and running. If you get errors like these, it's probably because register-svc.ps1 wasn't edited correctly.
Service "flanneld" installed successfully!
Set parameter "AppParameters" for service "flanneld".
Set parameter "AppEnvironmentExtra" for service "flanneld".
Set parameter "AppDirectory" for service "flanneld".
flanneld: START: The operation completed successfully.
Service "kubelet" installed successfully!
Set parameter "AppParameters" for service "kubelet".
Set parameter "AppDirectory" for service "kubelet".
kubelet: Unexpected status SERVICE_PAUSED in response to START control.
Service "kube-proxy" installed successfully!
Set parameter "AppDirectory" for service "kube-proxy".
Set parameter "AppParameters" for service "kube-proxy".
Set parameter "DependOnService" for service "kube-proxy".
kube-proxy: START: The operation completed successfully.
If you've already added the services and need to make changes, you can do that by either editing the service or removing them and re-registering with the commands listed below.
PS C:\k> .\nssm.exe edit kubelet
PS C:\k> .\nssm.exe edit kube-proxy
PS C:\k> .\nssm.exe edit flanneld
PS C:\k> .\nssm.exe remove kubelet confirm
PS C:\k> .\nssm.exe remove kube-proxy confirm
PS C:\k> .\nssm.exe remove flanneld confirm
Reboot to verify your node re-registers with kubernetes correctly and that you can deploy a pod using the test above.

Deleting/Re-adding A Windows Node

If you delete a windows node, such as with kubectl delete node <node_name>, adding it is pretty easy. Because the windows nodes have the kubernetes config file they re-register automatically on every service start. You might need to remove existing flannel configuration files and then reboot.
PS C:\k> Remove-Item C:\k\SourceVip.json
PS C:\k> Remove-Item C:\k\SourceVipRequest.json
PS C:\k> Restart-Computer

Broken Kubernetes Things With Windows Nodes

Pretty much everything in broken. You'll be able to deploy a windows container to a windows node using the node selector spec entry like we did when patching the daemonsets above. Just place windows as the OS type instead of linux. Here's a list of things that are broken which I'll update when possible:
  • Persistent Volumes - you need to ensure the node is registered properly with vsphere or nothing will be able to use a persistent volume. This is because vsphere ensures all nodes can see a datastore without making a distinction between windows and linux. I can get a PV to appear on a windows node but I can't get it to initialize properly
  • Node Ports - this is a documented limitation, you can't access a node port service from the node hosting the pod. Strange, yes, but you should be able to use any linux nodes as an entry for any windows pods
  • Load Balancer - With version 0.8.0 it includes the os selector patch, and it should work forwarding connecting through available linux nodes but I haven't had any success yet
  • DNS - untested as of yet because of load balancer problems
  • Logging - should be possible as fluent bit has beta support for windows but untested yet
    • Fluent bit does have some documentation to install under windows, maybe under the node itself as there isn't a docker container readily available, but none of the links work. Perhaps not yet (Aug 2019)
  • Monitoring - should also be possible using WMI exporter rather than a node exporter, again, untested at this time

Friday, July 12, 2019

Kuberentes Infrastructure Overview

I've posted several blog entries to setup various parts of an on-premise Kubernetes installation. This is meant as a summary referencing code posted to github for easy access. You can clone the entire repository, edit the required files and use deploy.sh/cleanup.sh scripts, or run the deployment directly from github as documented below. Each of the headers below is a link to the corresponding blog describing the process in detail.

If you'd like to clone the code run this command.
[root@kube-master ~]# git clone https://github.com/mike-england/kubernetes-infra.git

Cluster Install

While this can be automated through templates or tools like terraform, for now, I recommend following the post specifically for this.








Logging

This setup can be almost entirely automated, but unfortunately you'll need to modify the elasticsearch output in the config file
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-role.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-configmap.yaml
<modify output server entry elasticsearch.prod.int.com entry and index to match your kubernetes cluster name>
[root@kube-master ~]# kubectl create -f fluent-bit-configmap.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-daemon-set.yaml

Load Balancing

Installation from metallb is straight forward. As with logging, you'll need to modify the config map, this time changing the IP range. If you're running a cluster with windows nodes, be sure to patch the metallb daemonset so it doesn't get deployed to any of those nodes.
[root@kube-master ~]# kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/metal-config.yaml
<modify ip address range>
[root@kube-master ~]# kubectl create -f metal-config.yaml
if you're running a mixed cluster with windows nodes
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/node-selector-patch.yaml
[root@kube-master ~]# kubectl patch ds/speaker --patch "$(cat node-selector-patch.yaml)" -n=metallb-system

Monitoring

Assuming you have the load balancer installed above, you should be able to deploy monitoring without any changes.
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-prometheus.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-config-map.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-server.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-node-exporter.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-kube-state.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-kube-state.yaml

DNS Services

Again, with the load balancer in place, this should be deployable as is.
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/dns-namespace.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/etcd.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/external-dns.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/coredns.yaml

Tuesday, July 2, 2019

External DNS For Kubernetes Services

A service isn't useful if you can't access it, and while IP addresses are nice, it doesn't really help deliver user facing services. Really we want DNS, but given the dynamic nature of kubernetes it's impractical to implement the static configurations of the past. To solve that, we're going to implement ExternalDNS for kubernetes which will scan services and ingress points to automatically create and destroy DNS records for the cluster. Of course, nothing is completely simple in kubernetes, so we'll need a few pieces in place:
  • ExternalDNS - the scanning engine to create and destroy DNS records
  • CoreDNS - a lightweight kubernetes based DNS server to respond to client requests
  • Etcd - a key/value store to hold DNS records

Namespace

The first thing we're going to need is a namespace to put things. I normally keep this with one of the key pieces but felt it was better as a separate file in this case.
$ cat dns-namespace.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: dns

Etcd Cluster Setup

Technically we only need one etcd node as we don't really need the data to persist, it'd just be regenerated on the next scan, but it would halt all non-cached dns queries, so, I opted to create 3 instances. I didn't want to use an external etcd discovery service so I needed to have predictable pod names, and in order to do that, we need a stateful set rather than a deployment. If we lost a pod in the stateful set, the pod won't rejoin the cluster without having a persistent volume containing the configuration information, which is why we have a small pv for each.

If you're going to change any of the names, make sure the service name "etcd-dns" exactly matches the stateful set name. If it doesn't, kubernetes won't create an internal DNS record and the nodes won't be able to find each other; speaking from experience.
$ cat etcd.yaml 
apiVersion: v1
kind: Service
metadata:
  name: etcd-dns
  namespace: dns
spec:
  ports:
  - name: etcd-client
    port: 2379
    protocol: TCP
  - name: etcd-peer
    port: 2380
    protocol: TCP
  selector:
    app: etcd-dns
  publishNotReadyAddresses: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-dns
  namespace: dns
  labels:
    app: etcd-dns
spec:
  serviceName: "etcd-dns"
  replicas: 3
  selector:
    matchLabels:
      app: etcd-dns
  template:
    metadata:
      labels:
        app: etcd-dns
    spec:
      containers:
      - name: etcd-dns
        image: quay.io/coreos/etcd:latest
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        env:
        - name: CLUSTER_SIZE
          value: "3"
        - name: SET_NAME
          value: "etcd-dns"
        volumeMounts:
        - name: datadir
          mountPath: /var/run/etcd
        command:
          - /bin/sh
          - -c
          - |
            IP=$(hostname -i)
            PEERS=""
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
                PEERS="${PEERS}${PEERS:+,}${SET_NAME}-${i}=http://${SET_NAME}-${i}.${SET_NAME}:2380"
            done

            exec /usr/local/bin/etcd --name ${HOSTNAME} \
              --listen-peer-urls http://${IP}:2380 \
              --listen-client-urls http://${IP}:2379,http://127.0.0.1:2379 \
              --advertise-client-urls http://${HOSTNAME}.${SET_NAME}:2379 \
              --initial-advertise-peer-urls http://${HOSTNAME}.${SET_NAME}:2380 \
              --initial-cluster-token etcd-cluster-1 \
              --initial-cluster ${PEERS} \
              --initial-cluster-state new \
              --data-dir /var/run/etcd/default.etcd
        ports:
        - containerPort: 2379
          name: client
          protocol: TCP
        - containerPort: 2380
          name: peer
          protocol: TCP
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi
Cluster initialization is the more complicated part in this set. We're running some shell commands within the newly booted pod to fill in the required values with the PEERS variable looking like this when it's done. Could you hard code it? Sure, but that would complicate things if you change the set name or number of replicas. You can also do lots and lots of fancy stuff to remove, add, or rejoin nodes but we don't really need more than an initial static value (three in this case) so I'll leave things simple. You can check out the links in the notes section for more complicated examples.
etcd-dns-0=http://etcd-dns-0.etcd-dns:2380,etcd-dns-1=http://etcd-dns-1.etcd-dns:2380,etcd-dns-2=http://etcd-dns-2.etcd-dns:2380
If you'd like to enable https on your etcd cluster, you can easily do so by adding --auto-tls and --peer-auto-tls but this will create problems getting coredns and external-dns to connect without adding the certs there too.

CoreDNS Setup

As the end point to actually serve client requests, this is also an important piece to ensure it stays running, however, we don't really care about the data as it's backed by etcd. So, to handle this, we'll use a 3 pod deployment with a front end service. This uses a service type of LoadBalancer making it easily available to clients, so make sure you have that available. If you don't, see a previous post to install and configure MetalLB.

You might also notice that we're opening up both TCP and UDP DNS ports but only exposing UDP from the load balancer. This is largely because a load balancer can't implement both UDP and TCP at the same time, so feel free to remove TCP if you like. At some point I have hope multi protocol load balancers will be easier to manage so for now I'm leaving it in.
$ cat coredns.yaml 
apiVersion: v1
kind: Service
metadata:
  name: coredns
  namespace: dns
spec:
  ports:
  - name: coredns
    port: 53
    protocol: UDP
    targetPort: 53
  selector:
    app: coredns
  type: LoadBalancer
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: dns
data:
  Corefile: |
    . {
        errors
        health
        log
        etcd {
           endpoint http://etcd-dns:2379
        }
        cache 30
        prometheus 0.0.0.0:9153
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: dns
  labels:
    app: coredns
spec:
  replicas: 3
  selector:
    matchLabels:
      app: coredns
  template:
    metadata:
      labels:
        app: coredns
        k8s_app: kube-dns
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9153"
        prometheus.io/path: /metrics
    spec:
      containers:
      - name: coredns
        image: coredns/coredns:latest
        imagePullPolicy: IfNotPresent
        args: [ "-conf", "/etc/coredns/Corefile" ]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
      volumes:
      - name: config-volume
        configMap:
          name: coredns
          items:
          - key: Corefile
            path: Corefile
There are quite a few plugins [https://coredns.io/plugins/] you can apply to your coredns implementation, some of which you might want to play with. The documentation for these is quite good and easy to implement; they'd go in the ConfigMap with the errors and health entry. Just add the plugin name and any parameters they might take on a line and you're good to go. You may want to remove the log entry if your dns server is really busy or you don't want to see the continual stream of dns updates.

I'll also make special mention of the . { } block in the config map. This tells coredns to accept an entry for any domain which might not be to your liking. In my opinion, this provides the most flexibility as this shouldn't be your site's primary DNS server. Requests for a specific domain or subdomain should be forwarded here from your primary DNS, however, if you want to change this you'd simply enter one or more blocks such as example.org { } instead of . { }.

External DNS

Finally, the reason where here, deploying external-dns to our cluster. A couple of notes here; I've selected to scan the cluster for new or missing services every 15 seconds. This makes the DNS system feel very snappy when creating a service but might be too much or too little for your environment. I found the documentation particularly frustrating here. The closest example I found using coredns leverages minikube with confusing options and commands to diff a helm chart which doesn't feel very complete or intuitive to me.
$ cat external-dns.yaml 
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get","watch","list"]
- apiGroups: ["extensions"]
  resources: ["ingresses"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: registry.opensource.zalan.do/teapot/external-dns:latest
        args:
        - --source=service
        - --source=ingress
        - --provider=coredns
        - --registry=txt
        - --log-level=info
        - --interval=15s
        env:
          - name: ETCD_URLS 
            value: http://etcd-dns:2379
I've left the log-level entry in although the default is info anyway as it's a helpful placeholder when you want/need to change it. The log options, which I couldn't find any documentation for and had to look within the code are: panic, debug, info, warning, error, fatal. You'll also notice a reference to our Etcd cluster service here so if you've changed that name make sure you change it here too.

Deployment and Cleanup Scripts

As I like to do, here are some quick deployment and cleanup scripts which can be helpful when testing over and over again:
$ cat deploy.sh 
kubectl create -f dns-namespace.yaml
kubectl create -f etcd.yaml
kubectl create -f external-dns.yaml
kubectl create -f coredns.yaml
As a reminder, deleting the namespace will cleanup all the persistent volumes too. All of the data will be recreated on the fly but it means a few extra seconds for the system to reclaim them and recreate when you deploy.
$ cat cleanup.sh 
kubectl delete namespace dns
kubectl delete clusterrole external-dns
kubectl delete clusterrolebinding external-dns-viewer

Success State

I also had trouble finding out what good looked like so here's what you're looking for in the logs:
$ kubectl logs -n dns external-dns-57959dcfd8-fgqpn
time="2019-06-27T01:45:21Z" level=error msg="context deadline exceeded"
time="2019-06-27T01:45:31Z" level=info msg="Add/set key /skydns/org/example/nginx/66eeb21d to Host=10.9.176.196, Text=\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/default/nginx-frontend\", TTL=0"
The actual pod name will be different for you as we used a deployment. You can get the exact name using kubectl get pods -n dns. In this example, the "context deadline exceeded" is bad. It means external dns wasn't able to register the entry with etcd, in this case because that cluster was still booting. The last line shows a successful update into etcd.

Etcd has too much to post here, but you'll see entries indicating it can't resolve a host as they boot up, and potentially several MsgVote requests as the services start on all pods. In the end it should establish a peer connection with all of the nodes and indicate the api is enabled.
$ kubectl logs -n dns etcd-dns-0
2019-06-27 01:45:15.124897 W | rafthttp: health check for peer c77fa62c6a3a8c7e could not connect: dial tcp: lookup etcd-dns-1.etcd-dns on 10.96.0.10:53: no such host
2019-06-27 01:45:15.128194 W | rafthttp: health check for peer dcb7067c28407ab9 could not connect: dial tcp: lookup etcd-dns-2.etcd-dns on 10.96.0.10:53: no such host

2019-06-27 01:45:15.272084 I | raft: 7300ad5a4b7e21a6 received MsgVoteResp from 7300ad5a4b7e21a6 at term 4
2019-06-27 01:45:15.272096 I | raft: 7300ad5a4b7e21a6 [logterm: 1, index: 3] sent MsgVote request to c77fa62c6a3a8c7e at term 4
2019-06-27 01:45:15.272105 I | raft: 7300ad5a4b7e21a6 [logterm: 1, index: 3] sent MsgVote request to dcb7067c28407ab9 at term 4
2019-06-27 01:45:17.127836 E | etcdserver: publish error: etcdserver: request timed out

2019-06-27 01:45:41.087147 I | rafthttp: peer dcb7067c28407ab9 became active
2019-06-27 01:45:41.087174 I | rafthttp: established a TCP streaming connection with peer dcb7067c28407ab9 (stream Message writer)
2019-06-27 01:45:41.098636 I | rafthttp: established a TCP streaming connection with peer dcb7067c28407ab9 (stream MsgApp v2 writer)
2019-06-27 01:45:42.350041 N | etcdserver/membership: updated the cluster version from 3.0 to 3.3
2019-06-27 01:45:42.350158 I | etcdserver/api: enabled capabilities for version 3.3
If your cluster won't start or ends up in a CrashLoopBackOff, most of the time I found the problem to be host resolution (dns). You can try changing the PEER entry from ${SET_NAME}-${i}.${SET_NAME} to just ${SET_NAME}. This won't let the cluster work, but should let you get far enough to see what's going on inside the pod. I'd also recommend setting the replicas to 1 when troubleshooting.

CoreDNS is pretty straight forward. It'll just log a startup and then client queries which looks like these examples, where the first response, nginx.example.org, returns noerror (this is good) and the second, ngingx2.example.org, returning nxdomain meaning the record doesn't exist. Again, if you want to cut down on these messages remove the log line from the config file as stated above
$ kubectl logs -n dns coredns-6c8d7c7d79-6jm5l
.:53
2019-06-27T01:44:44.570Z [INFO] CoreDNS-1.5.0
2019-06-27T01:44:44.570Z [INFO] linux/amd64, go1.12.2, e3f9a80
CoreDNS-1.5.0
linux/amd64, go1.12.2, e3f9a80
2019-06-27T02:11:43.552Z [INFO] 192.168.215.64:58369 - 10884 "A IN nginx.example.org. udp 35 false 512" NOERROR qr,aa,rd 68 0.002999881s
2019-06-27T02:13:08.448Z [INFO] 192.168.215.64:64219 - 40406 "A IN nginx2.example.org. udp 36 false 512" NXDOMAIN qr,aa,rd 87 0.007469218s

Using External DNS

To actually have a DNS name register with external DNS, you need to add an annotation to your service. Here's one for nginx that would register an external load balancer and that IP with the name nginx.example.org
$ cat nginx-service.yaml 
apiVersion: v1
kind: Service
metadata:
  name: nginx-frontend
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "nginx.example.org"
spec:
  ports:
  - name: "web"
    port: 80
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
From a linux or mac host, you can use nslookup to verify the entry where 10.9.176.212 is the IP of my coredns service.
$ kubectl get svc -n dns
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)             AGE
coredns    LoadBalancer   10.100.208.145   10.9.176.212   53:31985/UDP        20h
etcd-dns   ClusterIP      10.100.83.154    <none>         2379/TCP,2380/TCP   20h
$ nslookup nginx.example.org 10.9.176.212
Server:  10.9.176.212
Address: 10.9.176.212#53

Name: nginx.example.org
Address: 10.9.176.213

Notes

Kubernetes already comes with an etcd, and for newer releases, coredns, so why not use those? We'll you probably can but, in my opinion, these are meant for core cluster functions and we shouldn't be messing around with them, and, they're secured with https so you'd need to go through the process of getting certificates set up. While I didn't find any links that really suited my needs, here are some that helped me along, maybe they'll help you too.