Wednesday, February 5, 2020

Elasticsearch ECK Snapshot To S3

Elasticsearch has published an API, Elastic Cloud on Kubernetes or ECK, which I think is really, really great. It simplifies a lot, especially for those of us looking to setup relatively simple clusters. I did, however, have quite a bit of trouble following the documentation to get an S3 bucket connected for snapshot purposes, so I thought I'd document my solution in a quick write up.

The guide is written using GCS (Google Cloud Storage), and while there is information on the S3 plugin, I just couldn't get things talking correctly. I wanted to use a specific access ID and key as it was easier for me to control in our environment. To do that, the documentation requires you to inject both the plugin and a secrets file which looks something like this:
ubuntu@k8s-master:~$ cat s3.client.default.credentials_file 
{
  "s3.client.default.access_key": "HLQA2HMA2FG3ABK4L2FV"
  "s3.client.default.secret_key": "3zmdKM2KEy/oPOGZfZpWJR3T46TxwtyMxZRpQQgF"
}
ubuntu@k8s-master:~$ kubectl create secret generic s3-credentials --from-file=s3.client.default.credentials_file
I kept getting errors like this:
unknown secure setting s3.client.default.credentials_file
If you look at what the above command created, it setup a secret where the key is s3.client.default.credentials_file and the value is the json from the file. What I think it's supposed to look like is a key of s3.client.default.access_key with the value being the actual key. So, I came up with this; please pay attention to the namespace:
ubuntu@k8s-master:~$ cat s3-credentials.yaml
apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
  namespace: elastic-dev
type: Opaque
data:
  s3.client.default.access_key: SExRQTJITUEyRkczQUJLNEwyRlYK
  s3.client.default.secret_key: M3ptZEtNMktFeS9vUE9HWmZacFdKUjNUNDZUeHd0eU14WlJwUVFnRgo=
ubuntu@k8s-master:~$ kubectl apply -f s3-credentials.yaml
You'll also notice that the access key and secret don't match the previous entry. That's because a kubernetes secret requires them to be base64 encoded, which is awesome as it gets around all the special character exceptions. That's simple enough:
ubuntu@k8s-master:~$ echo "HLQA2HMA2FG3ABK4L2FV" | base64
ubuntu@k8s-master:~$ echo "3zmdKM2KEy/oPOGZfZpWJR3T46TxwtyMxZRpQQgF" | base64
The last piece is the elasticsearch configuration itself. There are two notable pieces; the secureSettings entry and the S3 plugin itself. I've also elected to disable TLS as I'm terminating my encryption at an ingest point with this very basic configuration, but hopefully gets the basics going for further customization.
ubuntu@k8s-master:~$ cat elasticsearch.yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elastic-dev
  namespace: elastic-dev
spec:
  version: 7.5.2
  secureSettings:
  - secretName: s3-credentials
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 3Gi
    podTemplate:
      spec:
        initContainers:
        - name: install-plugins
          command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch repository-s3
        containers:
        - name: elasticsearch
          # specify resource limits and requests
          resources:
            limits:
              memory: 1Gi
              cpu: 0.5
          env:
          - name: ES_JAVA_OPTS
            value: "-Xms512m -Xmx512m"
  http:
    tls:
      selfSignedCertificate:
        disabled: true

Wednesday, January 8, 2020

Kubernetes with vSphere CSI and CPI Part 2

In the last post we covered how to setup the initial cluster and join worker nodes with a vSphere external provider. In this post we'll cover actually installing the CNI (Cluster Network Interface), CPI (Cloud Provider Interface), and CSI (Container Storage Interface).

Calico Networking

We'll be using Calico for our CNI. You might want to check for the latest version, but installation is as simple as this on your master node:
ubuntu@k8s-master:~$ kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
While this will complete, it won't actually start properly because the nodes are still tainted. They need the CPI to be running first.

Configuration Files

I'm going to setup all of the required configurations at once. There are a few required:

CPI Config Map
Hopefully this makes sense. We're going to reference a secret called vsphere-credentials, which we'll create in a moment, and under the virtual center with the IP listed, kubernetes is deployed within the data center Vancouver.
ubuntu@k8s-master:~$ cat vsphere.conf 
[Global]
port = "443"
insecure-flag = "true"
secret-name = "vsphere-credentials"
secret-namespace = "kube-system"

[VirtualCenter "10.9.178.236"]
datacenters = "Vancouver"
ubuntu@k8s-master:~$ kubectl create configmap cloud-config --from-file=vsphere.conf --namespace=kube-system
CPI Secret
Again, this should be pretty simple. I was hoping to use base64 encoding for the username and password as it gets around any special character problems but I can't seem to get that working. The best I can come up with is single tick marks, so if one of those is in your password or username you'll need to figure that out (or change it). This becomes a bigger problem with the CSI configuration below, so read ahead if you've got lots of special characters in your password.
ubuntu@k8s-master:~$ cat vsphere-credentials.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: vsphere-credentials
  namespace: kube-system
stringData:
  10.9.178.236.username: 'domain\mengland'
  10.9.178.236.password: 'lJSIuej5IU$'
ubuntu@k8s-master:~$ kubectl create -f vsphere-credentials.yaml
CSI Secret
ubuntu@k8s-master:~$ cat csi-vsphere.conf 
[Global]
cluster-id = "k8s-cluster1"
[VirtualCenter "10.9.178.236"]
insecure-flag = "true"
user = "mengland@itlab.domain.com"
password = "lJSIuej5IU$"
port = "443"
datacenters = "Vancouver"
ubuntu@k8s-master:~$ kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf --namespace=kube-system

Deploy CPI and CSI

Once those are completed, you need to create the roles and deploy the CPI and CSI as follows:
ubuntu@k8s-master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-vsphere/master/manifests/controller-manager/cloud-controller-manager-roles.yaml
ubuntu@k8s-master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-vsphere/master/manifests/controller-manager/cloud-controller-manager-role-bindings.yaml
ubuntu@k8s-master:~$ kubectl apply -f https://github.com/kubernetes/cloud-provider-vsphere/raw/master/manifests/controller-manager/vsphere-cloud-controller-manager-ds.yaml
The nodes should now no longer be tainted and Calico should start up. If they don't you can have a look at the logs for your vsphere-cloud-controller; it likely has communication or account problems with vsphere.
ubuntu@k8s-master:~$ kubectl describe nodes | egrep "Taints:|Name:"
Name:               k8s-master
Taints:             node-role.kubernetes.io/master:NoSchedule
Name:               k8s-worker
Taints:             
ubuntu@k8s-master:~$ kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-648f4868b8-lbfjk   1/1     Running   0          3m57s
calico-node-d7gkh                          1/1     Running   0          3m57s
calico-node-r8b72                          1/1     Running   0          3m57s
coredns-6955765f44-rsvhh                   1/1     Running   0          20m
coredns-6955765f44-sp7xv                   1/1     Running   0          20m
etcd-k8s-master7                           1/1     Running   0          20m
kube-apiserver-k8s-master7                 1/1     Running   0          20m
kube-controller-manager-k8s-master7        1/1     Running   0          20m
kube-proxy-67cgt                           1/1     Running   0          20m
kube-proxy-nctns                           1/1     Running   0          4m30s
kube-scheduler-k8s-master7                 1/1     Running   0          20m
vsphere-cloud-controller-manager-7b44g     1/1     Running   0          71s
ubuntu@k8s-master:~$ kubectl logs -n kube-system vsphere-cloud-controller-manager-7b44g
And then deploy the CSI driver with these commands
ubuntu@k8s-master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/master/manifests/v2.0.0/vsphere-67u3/vanilla/rbac/vsphere-csi-controller-rbac.yaml
ubuntu@k8s-master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/master/manifests/v2.0.0/vsphere-67u3/vanilla/deploy/vsphere-csi-controller-deployment.yaml
ubuntu@k8s-master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/master/manifests/v2.0.0/vsphere-67u3/vanilla/deploy/vsphere-csi-node-ds.yaml

Check Status

At this point you should have a running CPI and CSI implementation with some checks to verify
ubuntu@k8s-master:~$ kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-648f4868b8-lbfjk   1/1     Running   0          3d2h
calico-node-d7gkh                          1/1     Running   0          3d2h
calico-node-r8b72                          1/1     Running   0          3d2h
coredns-6955765f44-rsvhh                   1/1     Running   0          3d2h
coredns-6955765f44-sp7xv                   1/1     Running   0          3d2h
etcd-k8s-master7                           1/1     Running   0          3d2h
kube-apiserver-k8s-master7                 1/1     Running   0          3d2h
kube-controller-manager-k8s-master7        1/1     Running   0          3d2h
kube-proxy-67cgt                           1/1     Running   0          3d2h
kube-proxy-nctns                           1/1     Running   0          3d2h
kube-scheduler-k8s-master7                 1/1     Running   0          3d2h
vsphere-cloud-controller-manager-7b44g     1/1     Running   0          3d2h
vsphere-csi-controller-0                   5/5     Running   0          146m
vsphere-csi-node-q6pr4                     3/3     Running   0          13m
Other checks:
ubuntu@k8s-master:~$ kubectl get csidrivers
NAME                     CREATED AT
csi.vsphere.vmware.com   2020-01-04T00:48:57Z
ubuntu@k8s-master:~$ kubectl describe nodes | grep "ProviderID"
ProviderID:                   vsphere://a8f157cf-3607-43a2-8209-60200817677f
ProviderID:                   vsphere://0c723a5d-cb66-4aae-87d2-7fe673728fee

Likely Problems

If some of the pods aren't ready, you've got problems. The biggest one I had was a permission problem when trying to load CSI. The CSI driver uses a secret created from a configuration file instead of straight YAML. I'm not sure if that's part of the problem but it seems to have problems with backslashes and potentially other special characters. This means you can't have an account with domain\user format and can't have any backslashes in your password. The error looks like this:
ubuntu@k8s-master:~$ kubectl logs -n kube-system vsphere-csi-controller-0 vsphere-csi-controller
time="2020-01-06T21:45:27Z" level=fatal msg="grpc failed" error="ServerFaultCode: Cannot complete login due to an incorrect user name or password."
I've filed a project bug #121 but I'm not sure if everyone will agree it's a bug or if it'll be fixed. For now, my solution is to ensure you use double quotes around all strings and ensure your special characters are limited to the benign ones (no asterisk, backslash, single or double quotes, or tick marks).

If you need to reload your CSI secret file you can do so like this:
ubuntu@k8s-master:~$ kubectl delete secret -n kube-system vsphere-config-secret
ubuntu@k8s-master:~$ kubectl delete statefulset -n kube-system vsphere-csi-controller
And then re-upload with
ubuntu@k8s-master:~$ kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf -n kube-system
ubuntu@k8s-master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/master/manifests/1.14/deploy/vsphere-csi-controller-ss.yaml

Using Storage

First off, the fancy Cloud Native Storage for persistent volumes seems to only apply if you have VSAN, so break out your cheque book. For the rest of us, you will need to create a Storage Policy, in my case I did this based on tags. Basically, you create the tag, assign it to whatever datastores you'd like to use and then create the policy based on that tag.

Once that's done, we still need to define a storage class within the kubernetes cluster. This is similar when we added a storage class in my original blog post but we no longer need to specify a disk format or datastore target. Instead, we just reference the created storage policy, in my case, I called it k8s-default. The name of the storage class, in this example, vsphere-ssd, is what the rest of the kubernetes cluster will know the storage as. We also define this as the default so if a user doesn't request the storage name, this is what they get.
ubuntu@k8s-master:~$ cat vmware-storage.yaml 
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: vsphere-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
parameters:
  storagepolicyname: "k8s-default"
provisioner: csi.vsphere.vmware.com
ubuntu@k8s-master:~$ kubectl create -f vmware-storage.yaml 
And then to use it it's just like any other PVC on any other cluster
ubuntu@k8s-master:~$ cat pvc_test.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
ubuntu@k8s-master:~$ kubectl create -f pvc_test.yaml
ubuntu@k8s-master:~$ kubectl get pvc
NAME   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test   Bound    pvc-bf37afa8-0065-47a9-8c7c-89db7907d538   5Gi        RWO            vsphere-ssd    32m

Storage Problems

If, when you try to attach a PV to a pod you get a message like this:
Warning  FailedMount 8s (x7 over 40s) kubelet, k8s-worker2 MountVolume.MountDevice failed for volume "pvc-194d886b-e988-48c4-8802-04da2015db4b" : rpc error: code = Internal desc = Error trying to read attached disks: open /dev/disk/by-id: no such file or directory
This is likely because you forgot to set disk.enableUUID on the virtual machine. You can have a look at Node Setup under my earlier blog post. You can also run govc if you have that setup on your machine with the following command:
govc vm.change -vm k8s-worker2 -e="disk.enableUUID=1"
You'll need to reboot the worker node, and recreate any pvc.

Monday, January 6, 2020

Kubernetes with vSphere CSI and CPI Part 1

About a year ago I wrote an article outlining steps to follow to get vSphere and Kubernetes working together. At that time I mentioned that cloud providers within Kubernetes had been deprecated but the replacements weren't ready yet. Well that's changed, so I thought I'd get an updated article outlining the new steps.
Unlike the last time I did this, there's decent documentation out there, and I'd encourage you to have a read through it, but a couple of things bothered me.
  • The document uses outdated APIs. Kubernetes still has a long way to go (in my opinion) and the pace of change is remarkable, so I guess this is to be expected
  • It isn't explained why some of the operations need to be done, so I'm going to try to explain the why with the minimum required steps

Documentation Links


Kubernetes Installation

I'm actually going to assume you have at least two Linux machine available to install kubernetes on, one master and one worker. If you follow the guide above you should be in a pretty good position to deploy kubernetes, so to start, we'll need a configuration file. This is required because you can't specify a cloud provider from the command line, and because of that you need to specify everything within a config file. What I'd really like to do is just add a "--cloud-provider" and be done, but no, at least, not for now.

The vSphere guide includes a lot of things I don't like. It specifies a specific etcd and coreDNS version, it also specifies a specific kubernetes version, none of which I wanted. Here is a minimal configuration. It's using the current apiVersion, v1beta2, although you can check for later with this link, package kubeadm.
ubuntu@k8s-master:~$ cat kubeadminit.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
  kubeletExtraArgs:
    cloud-provider: external
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
networking:
  podSubnet: "192.168.0.0/16"
We've actually got two configuration options in this file. An Init Configuration which simply tells kubernetes to use an external cloud provider, and because I want to use Calico with a 192.168.0.0/16 subnet, that's in the Cluster Configuration section. The official guide also has a bootstrap token, which you can specify if you like (we'll need it later), but I let kubeadm generate one that we can use when joining nodes. It doesn't matter where this file goes, your home folder is just fine.

We then use this config file to initialize the cluster
ubuntu@k8s-master:~$ sudo kubeadm init --config kubeadminit.yaml
You do need to run this as root (or sudo in this case) and need to pay attention to a couple of things in the output
  • Setup kubectl, which will look like this; do these now, you'll need kubectl for all the other commands:
  • mkdir -p $HOME/.kube
    sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
    sudo chown $(id -u):$(id -g) $HOME/.kube/config
    
  • The join command which has our token
  • kubeadm join 10.2.44.53:6443 --token fa5p9m.j4qygsv5t601ug62 \
        --discovery-token-ca-cert-hash sha256:c1653ee75b86dcff36cd006730d5989048ab54e29c30290e8826aeaa752b3428 
    
Note the highlighted token that was generated for you, or if you specified one, it should be also listed here.

Normally, we'd just run the cluster join command on all the workers, but because we need to tell them to use an external cloud provider we have a chicken and egg problem as outlined in the Kubernetes Cloud Controller Manager link above. To get around this, we need to export discovery information from the master which includes address and certificate information with this command.
ubuntu@k8s-master:~$ kubectl -n kube-public get configmap cluster-info -o jsonpath='{.data.kubeconfig}' > discovery.yaml
This will produce a file that looks something like this:
ubuntu@k8s-master:~$ cat discovery.yaml
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <long_cert_will_be_here>
    server: https://10.2.44.53:6443
  name: ""
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
Now you'll need to scp that to all worker nodes. Again, it doesn't matter where it goes. It can also technically be on a web server over https if you'd rather.
ubuntu@k8s-master:~$ scp discovery.yaml ubuntu@k8s-worker1:/home/ubuntu/

Joining Worker Nodes

Like the master, we need to specify an external cloud provider, and because there isn't a command line option, we need a new configuration file. There are three important parts to this file:
  • A path to our discovery file
  • The tls bootstrap token when we initialized the cluster
  • Tell the worker to use an external cloud provider (the point of all of this)
To do that we'll have a file like this:
ubuntu@k8s-master:~$ cat kubeadminitworker.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
  file:
    kubeConfigPath: /home/ubuntu/discovery.yaml
  tlsBootstrapToken: fa5p9m.j4qygsv5t601ug62
nodeRegistration:
  kubeletExtraArgs:
    cloud-provider: external
And then it's a simple command to join:
ubuntu@k8s-worker-1:~$ sudo kubeadm join --config /home/ubuntu/kubeadminitworker.yaml

Node Verification

Back on the master, make sure any new nodes show up and that they have a tainted flag applied
ubuntu@k8s-master:~$ kubectl describe nodes | egrep "Taints:|Name:"
Name:               k8s-master
Taints:             node-role.kubernetes.io/master:NoSchedule
Name:               k8s-worker1
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Name:               k8s-worker2
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Not quite in a running state but we'll finish things off in part 2 of this post.

Friday, August 30, 2019

Graphing Prometheus Data With Grafana

Monitoring data isn't much use without the ability to display it. If you want to learn how to setup prometheus on your kubernetes cluster, have a look at this previous blog post, Monitoring Kubernetes With Prometheus. There are also lots of quasi pre-built dashboards over at grafana, but as I've found, you're invariably going to need to build your own. This post is going to look at how to put together some more problematic queries and results without losing your sanity.

Problem Statement

I want the ability to drill in to my cluster starting with the cluster, to the nodes, to pods on that node, and information about those pods. In this case I wanted to end up with a table that looks something like this:
We'll focus in on the persistent volume information for now as this is where things get complicated. For all the other metrics, you can key off the pod name, every query has that in the result and Grafana will find the common names and match them up. With a persistent volume, we need to make multiple queries following a path to get the final capacity numbers. There are a couple of ways to do this, but I followed this query logic. Max just tells prometheus to return parts of the query rather than every label.

Query #1 - max(kube_pod_info{node="$node"}) by (pod)
-- This will return a list of pods for a given node
Query #2 - max(kube_pod_spec_volumes_persistentvolumeclaims_info) by (persistentvolumeclaim, pod, volume)
-- Will return a list of persistent volume claims for each pod
Query #3 - max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim)
-- Will return a list of persistent volume claims and their size as the value
I end up with a query path that looks like this:
The problems is, if you query all three of these individually, Grafana won't know how to assemble your table as there is no linkage between all three. If you use pod name then the PVC capacity doesn't have a match. If you use persistent volume claim then the list of pods doesn't have a match.

Solution

Query #1 is fine. It's pulling a full list of pods, but somehow I need a combination of query #2 and query #3 where we take the labels of query #2 and merge it with the result of query #3. Without that combination there's no way to match the capacity all the way back up to the node and you get a very ugly table. The closest explanation I found was on stack overflow but it still took a bit to translate that to my requirements so I'm going to try and show this with the results from each query.
  • the value_metric - max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim)
    • {persistentvolumeclaim="datadir-etcd-dns-0"} 1073741824
  • the info_metric - max(kube_pod_spec_volumes_persistentvolumeclaims_info) by (persistentvolumeclaim, pod, volume)
    • {persistentvolumeclaim="datadir-etcd-dns-0",pod="etcd-dns-0",volume="datadir"} 1
We'll use two prometheus operators; on() which specified how to match the queries up, and group_left() which specifies the labels to pull from the info metric. Then we use the following format:
<value_metric> * on (<match_label>) group_left(<info_labels>) <info_metric>
And we end up with the following query:
max(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (persistentvolumeclaim) * on (persistentvolumeclaim) group_left(pod,volume) max(kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"$pod"}) by (persistentvolumeclaim, pod, volume)
Now when the query inspector is used we get an object containing 3 labels, pod, volume, and persistentvolumeclaim, and the value has a timestamp and our capacity information. This can now be paired up to the other queries containing a pod name because there's a common element

Wednesday, July 24, 2019

Windows Kubernetes Nodes

It's happened, someone has asked you for a windows container. The first piece of advice to give; find a way, any way possible to run the service within a linux container. Not being knowledgeable with linux shouldn't be an excuse as most containers require very little expertise with linux anyway. Help the windows developers migrate to linux, everyone will be happier.

That being said, sometimes re-coding a service for linux just isn't practical. And while things are pretty bad now, Microsoft is a really big company who seems to want into this market; they'll put effort into making things better over time. As an example of this, their documentation is quite good, way better than most of the linux documentation in my opinion. Have a look at it as portions of this guide are lifted directly from it [https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/getting-started-kubernetes-windows]

Cluster Setup

You'll need at least one linux box to serve as a master, although we're going to use a few more to host other infrastructure services. You can follow my previous blog post with one exception; you can't run Calico as Container Network Interface (CNI). Well, technically you can, but Calico for windows is provided only as a subscription service and Microsoft only documents networking with Flannel as the CNI, so that's what we'll use here.

**When you initialize the cluster, be sure to use Flannel's default pod cidr of 10.244.0.0/16 or you'll have problems when setting up microsoft nodes**
[root@kube-master ~]# kubeadm init --pod-network-cidr=10.244.0.0/16
Setting up Flannel is pretty easy you'll download the flannel yaml file and make some Microsoft specific changes, notably the VNI and PORT number as documented on github and from microsoft.
[root@kube-master ~]# wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Within the ConfigMap section you'll have the following under net-conf.json:
net-conf.json: |
{
  "Network": "10.244.0.0/16",
  "Backend": {
   "Type": "vxlan"
  }
}
We need to add the required VNI and Port information like this:
net-conf.json: |
{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan",
    "VNI": 4096,
    "Port": 4789
  }
}
And install Flannel like this:
[root@kube-master ~]# kubectl create -f kube-flannel.yml

CNI Notes

What's wrong with running Flannel for all clusters? Flannel has a separate private network allocated to each node which is then encapsulated within UDP and passer to other nodes within the cluster. Microsoft supports two Flannel modes, vxlan mode (documented here) which creates a virtual overlay network to handle routes between nodes automatically, and host-gateway mode, which seems insane to me as you need a static route on each node to every other node's pod subnet; so I don't recommend that.

Calico, on the other hand, uses simple L3 routing within the cluster so it's much easier to see where traffic is going and where it came from. I like the ideal of Calico better, but it isn't a real option without a subscription so I'll stick with Flannel on my windows cluster. There are a few decent articles on the differences between the two:

Windows Nodes

You'll need to install Windows 2019. I use 2019 standard with desktop experience as I like to RDP to the box but maybe you're an extreme windows guru and can do all of this without. I've disabled the firewall and installed vmware tools. Joining a domain is entirely optional as we aren't going to use any of the domain services. If you do join, make sure you treat this as a high performance server, so take care with patch schedules and extra windows features like virus scanning. You'll also need to ensure your patch level is high enough. I recommend running microsoft update, again, and again as you'll get new patches after a reboot. The version you're running should be at least 17763.379 as provided by KB4489899. You can find this by running winver.

As mentioned before, Microsoft has done a good job documenting the steps so feel free to follow along there too. Everything should be done with an elevated powershell prompt (run as administrator). These first steps will add the repository and install docker.
PS C:\Users\Administrator> Install-Module -Name DockerMsftProvider -Repository PSGallery
PS C:\Users\Administrator> Install-Package -Name Docker -ProviderName DockerMsftProvider
Reboot the machine and check docker is running properly
PS C:\Users\Administrator> Restart-Computer
PS C:\Users\Administrator> docker version
Client: Docker Engine - Enterprise
 Version:           19.03.0
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        87b1f470ad
 Built:             07/16/2019 23:41:30
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.0
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.5
  Git commit:       87b1f470ad
  Built:            07/16/2019 23:39:21
  OS/Arch:          windows/amd64
  Experimental:     false
If you get an error here that looks like this:
error during connect: Get http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.39/version: open //./pipe/docker_engine: The system cannot find the file specified. In the default daemon configuration on Windows, the docker client must be run elevated to connect. This error may also indicate that the docker daemon is not running.
It just means that docker service didn't start on boot. Start it from services or from powershell using Start-Service docker

Create Pause Image

A pause image is also run on your linux nodes, but automatically; we need to do that manually here including downloading it, tagging it, and check that it runs correctly.
PS C:\Users\Administrator> docker pull mcr.microsoft.com/windows/nanoserver:1809
PS C:\Users\Administrator> docker tag mcr.microsoft.com/windows/nanoserver:1809 microsoft/nanoserver:latest
PS C:\Users\Administrator> docker run microsoft/nanoserver:latest
Microsoft Windows [Version 10.0.17763.615]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\>

Download Node Binaries

You'll need several binaries available from Kubernetes' github page. The version should match the server as close as possible. The official skew policy can be found at kubernetes.io, and if you want to see your client and server version you can use this command.
[root@kube-master ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
This is saying my client and server are version 1.15.1. To download the corresponding client version, you can use this link [https://github.com/kubernetes/kubernetes/releases/], select the CHANGELOG-<version>.md link and download the node binaries for windows. In this case the latest is 1.15.1 , so that works out well.

I've used unix to expand the node binaries, either mac or your master node would work fine using tar zxvf kubernetes-node-windows-amd64.tar.gz but you can also use windows with expand-archive. Once that's done you'll need to copy all the executables under the expanded kubernetes/node/bin/* to c:\k. I know lots of people will want to change that \k folder but don't. Microsoft has hard coded it into many scripts we'll be using. So save yourself headache and just go with it.

You'll also need to grab /etc/kubernetes/admin.conf from the master node and place that in c:\k too and download Microsoft's start script. For all of these, I used a shared folder within my RDP session but winSCP is also a good tool if you don't mind installing more software on your worker nodes. It should look like this when you're done.
PS C:\Users\Administrator> mkdir c:\k
PS C:\Users\Administrator> wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/start.ps1 -o c:\k\start.ps1
<download and transfer kubernetes node binaries and config file>
PS C:\k> dir
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        7/23/2019   2:12 PM           5447 config
-a----        7/18/2019   2:55 AM       40072704 kube-proxy.exe
-a----        7/18/2019   2:55 AM       40113152 kubeadm.exe
-a----        7/18/2019   2:55 AM       43471360 kubectl.exe
-a----        7/18/2019   2:55 AM      116192256 kubelet.exe
-a----        7/23/2019   2:01 PM           2447 start.ps1

Joining A Windows Node

You're finally ready to join a windows node! Again you can have a look at the documentation but if you've been following along, you'll only need two options.
  • ManagementIP - this is unfortunate as it'll require more scripting when you're ready to automate. It's the IP address of this worker node which you can get from ipconfig on your windows node
  • NetworkMode - we're using vxlan and the default is l2bridge so this will need to be set to overlay
Other fields that should be fine with defaults but you can check them with these commands
  • ServiceCIDR - verify with kubectl cluster-info dump | grep -i service-cluster
  • ClusterCIDR - check with kubectl cluster-info dump | grep -i cluster-cidr
  • KubeDnsServiceIP - verify the default (10.96.0.10) with kubectl get svc -n kube-system. Cluster-IP is the field you're interested in.
When you run the start.ps1 script it'll download a lot of additional scripts and binaries eventually spawning a few new powershell windows leaving the logging one open, which can be very helpful at this stage. Run the following replacing the IP in blue with your local windows server IP address (from ipconfig)
PS C:\k> .\start.ps1 -ManagementIP 10.9.176.94 -NetworkMode overlay

Initial Problems

I had trouble getting the kubelet process to start. You'll notice the node doesn't go ready and if you look at the processes it will have flannel and kube-proxy but no kubelet. It seems the start-kubelet.ps1 script that's downloaded is using outdated flags, so to fix that, remove the highlighted --allow-privileged=true from start-kubelet.ps1.
$kubeletArgs = @(
    "--hostname-override=$(hostname)"
    '--v=6'
    '--pod-infra-container-image=mcr.microsoft.com/k8s/core/pause:1.0.0'
    '--resolv-conf=""'
    '--allow-privileged=true'
    '--enable-debugging-handlers'
    "--cluster-dns=$KubeDnsServiceIp"
    '--cluster-domain=cluster.local'
    '--kubeconfig=c:\k\config'
    '--hairpin-mode=promiscuous-bridge'
    '--image-pull-progress-deadline=20m'
    '--cgroups-per-qos=false'
    "--log-dir=$LogDir"
    '--logtostderr=false'
    '--enforce-node-allocatable=""'
    '--network-plugin=cni'
    '--cni-bin-dir="c:\k\cni"'
    '--cni-conf-dir="c:\k\cni\config"'
    "--node-ip=$(Get-MgmtIpAddress)"
)

I also had a problem when provisioning persistent volumes even though they weren't for the windows node. If kubernetes can't identify all nodes in the cluster, it won't do anything. The error looks like this
I0723 23:57:11.379621       1 event.go:258] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"715df13c-8eeb-4ba4-9be1-44c8a5f03071", APIVersion:"v1", ResourceVersion:"480073", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vsphere-ssd": No VM found
E0723 23:59:26.375664       1 datacenter.go:78] Unable to find VM by UUID. VM UUID: 
E0723 23:59:26.375705       1 nodemanager.go:431] Error "No VM found" node info for node "kube-w2" not found
E0723 23:59:26.375718       1 vsphere_util.go:130] Error while obtaining Kubernetes node nodeVmDetail details. error : No VM found
E0723 23:59:26.375727       1 vsphere.go:1291] Failed to get shared datastore: No VM found
E0723 23:59:26.375787       1 goroutinemap.go:150] Operation for "provision-default/test[715df13c-8eeb-4ba4-9be1-44c8a5f03071]" failed. No retries permitted until 2019-07-24 00:01:28.375767669 +0000 UTC m=+355638.918509528 (durationBeforeRetry 2m2s). Error: "No VM found"
I0723 23:59:26.376127       1 event.go:258] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"715df13c-8eeb-4ba4-9be1-44c8a5f03071", APIVersion:"v1", ResourceVersion:"480073", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vsphere-ssd": No VM found
And my eventual solution was to reboot the master node. Sad, yes.

Updating A Node UUID

Like our linux nodes, you'll need to patch the node spec with the UUID of the node. Under windows you can retrieve that UUID with the following command but you'll need to reformat it.
PS C:\k> wmic bios get serialnumber
SerialNumber
VMware-42 3c fe 01 af 23 a9 a5-65 45 50 a3 db db 9d 69
And back on our kubernetes master node we'd patch the node like this
[root@k8s-master ~]# kubectl patch node <node_name> -p '{"spec":{"providerID":"vsphere://423CFE01-AF23-A9A5-6545-50A3DBDB9D69"}}'

Patching DaemonSets

A DaemonSet gets a pod pushed to every node in the cluster. This is generally bad because most things don't run on windows, so to prevent that you'll need to patch existing sets and use the node selector for application you produce. You can download the patch from microsoft or create your own file, it's pretty basic. If you've been following along with the code provided on github, those files already have the node selector set.
[root@kube-master ~]# wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/l2bridge/manifests/node-selector-patch.yml
[root@kube-master t]# cat node-selector-patch.yml 
spec:
  template:
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
We'll need to apply it to existing DaemonSets, notably kube-proxy and kube-flannel-ds-amd64.
[root@kube-master ~]# kubectl patch ds/kube-flannel-ds-amd64 --patch "$(cat node-selector-patch.yml)" -n=kube-system
[root@kube-master ~]# kubectl patch ds/kube-proxy --patch "$(cat node-selector-patch.yml)" -n=kube-system
If you've been getting errors on your windows node from flannel saying things like Error response from daemon: network host not found and Error: no such container, those should now stop.

Deploying A Test Pod

I'd suggest using the Microsoft provided yaml file although I reduced the number of replicas to 1 to simplify any troubleshooting.
[root@kube-master ~]# wget https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/flannel/l2bridge/manifests/simpleweb.yml -O win-webserver.yaml
[root@kube-master ~]# kubectl apply -f win-webserver.yaml
[root@kube-master ~]# kubectl get pods -o wide

Registering A Service

Every time you reboot you'll need to run the start command manually, which isn't all that useful. Microsoft has created some excellent instructions and a script to register the required services using the Non-Sucking Service Manager. Follow the instructions provided by Microsoft, which is basically placing both the sample script, called register-svc.ps1, and nssm.exe binary into c:\k.
PS C:\k> wget https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/flannel/register-svc.ps1 -o c:\k\register-svc.ps1
I did have problems with the default script as it seems to reference an incorrect pause image and have a problem with the allow-privileged statement as indicated above. To fix that, edit register-svc.ps1 and under the kubelet registration replace the --pod-infra-container=kubeletwin/pause with mcr.microsoft.com/k8s/core/pause:1.0.0 and remove --allow-privileged=true. It should be line 25 and will look like this when you're done;
.\nssm.exe set $KubeletSvc AppParameters --hostname-override=$Hostname --v=6 --pod-infra-container-image=mcr.microsoft.com/k8s/core/pause:1.0.0 --resolv-conf="" --enable-debugging-handlers --cluster-dns=$KubeDnsServiceIP --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false  --log-dir=$LogDir --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config
Once that's fixed, you can register your services with this command where ManagementIP is the windows node IP.
PS C:\k> .\register-svc.ps1 -ManagementIP <windows_node_ip> -NetworkMode overlay
You should see the services registered and running. If you get errors like these, it's probably because register-svc.ps1 wasn't edited correctly.
Service "flanneld" installed successfully!
Set parameter "AppParameters" for service "flanneld".
Set parameter "AppEnvironmentExtra" for service "flanneld".
Set parameter "AppDirectory" for service "flanneld".
flanneld: START: The operation completed successfully.
Service "kubelet" installed successfully!
Set parameter "AppParameters" for service "kubelet".
Set parameter "AppDirectory" for service "kubelet".
kubelet: Unexpected status SERVICE_PAUSED in response to START control.
Service "kube-proxy" installed successfully!
Set parameter "AppDirectory" for service "kube-proxy".
Set parameter "AppParameters" for service "kube-proxy".
Set parameter "DependOnService" for service "kube-proxy".
kube-proxy: START: The operation completed successfully.
If you've already added the services and need to make changes, you can do that by either editing the service or removing them and re-registering with the commands listed below.
PS C:\k> .\nssm.exe edit kubelet
PS C:\k> .\nssm.exe edit kube-proxy
PS C:\k> .\nssm.exe edit flanneld
PS C:\k> .\nssm.exe remove kubelet confirm
PS C:\k> .\nssm.exe remove kube-proxy confirm
PS C:\k> .\nssm.exe remove flanneld confirm
Reboot to verify your node re-registers with kubernetes correctly and that you can deploy a pod using the test above.

Deleting/Re-adding A Windows Node

If you delete a windows node, such as with kubectl delete node <node_name>, adding it is pretty easy. Because the windows nodes have the kubernetes config file they re-register automatically on every service start. You might need to remove existing flannel configuration files and then reboot.
PS C:\k> Remove-Item C:\k\SourceVip.json
PS C:\k> Remove-Item C:\k\SourceVipRequest.json
PS C:\k> Restart-Computer

Broken Kubernetes Things With Windows Nodes

Pretty much everything in broken. You'll be able to deploy a windows container to a windows node using the node selector spec entry like we did when patching the daemonsets above. Just place windows as the OS type instead of linux. Here's a list of things that are broken which I'll update when possible:
  • Persistent Volumes - you need to ensure the node is registered properly with vsphere or nothing will be able to use a persistent volume. This is because vsphere ensures all nodes can see a datastore without making a distinction between windows and linux. I can get a PV to appear on a windows node but I can't get it to initialize properly
  • Node Ports - this is a documented limitation, you can't access a node port service from the node hosting the pod. Strange, yes, but you should be able to use any linux nodes as an entry for any windows pods
  • Load Balancer - With version 0.8.0 it includes the os selector patch, and it should work forwarding connecting through available linux nodes but I haven't had any success yet
  • DNS - untested as of yet because of load balancer problems
  • Logging - should be possible as fluent bit has beta support for windows but untested yet
    • Fluent bit does have some documentation to install under windows, maybe under the node itself as there isn't a docker container readily available, but none of the links work. Perhaps not yet (Aug 2019)
  • Monitoring - should also be possible using WMI exporter rather than a node exporter, again, untested at this time

Friday, July 12, 2019

Kuberentes Infrastructure Overview

I've posted several blog entries to setup various parts of an on-premise Kubernetes installation. This is meant as a summary referencing code posted to github for easy access. You can clone the entire repository, edit the required files and use deploy.sh/cleanup.sh scripts, or run the deployment directly from github as documented below. Each of the headers below is a link to the corresponding blog describing the process in detail.

If you'd like to clone the code run this command.
[root@kube-master ~]# git clone https://github.com/mike-england/kubernetes-infra.git

Cluster Install

While this can be automated through templates or tools like terraform, for now, I recommend following the post specifically for this.








Logging

This setup can be almost entirely automated, but unfortunately you'll need to modify the elasticsearch output in the config file
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-role.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-configmap.yaml
<modify output server entry elasticsearch.prod.int.com entry and index to match your kubernetes cluster name>
[root@kube-master ~]# kubectl create -f fluent-bit-configmap.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-daemon-set.yaml

Load Balancing

Installation from metallb is straight forward. As with logging, you'll need to modify the config map, this time changing the IP range. If you're running a cluster with windows nodes, be sure to patch the metallb daemonset so it doesn't get deployed to any of those nodes.
[root@kube-master ~]# kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/metal-config.yaml
<modify ip address range>
[root@kube-master ~]# kubectl create -f metal-config.yaml
if you're running a mixed cluster with windows nodes
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/node-selector-patch.yaml
[root@kube-master ~]# kubectl patch ds/speaker --patch "$(cat node-selector-patch.yaml)" -n=metallb-system

Monitoring

Assuming you have the load balancer installed above, you should be able to deploy monitoring without any changes.
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-prometheus.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-config-map.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-server.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-node-exporter.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-kube-state.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-kube-state.yaml

DNS Services

Again, with the load balancer in place, this should be deployable as is.
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/dns-namespace.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/etcd.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/external-dns.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/coredns.yaml

Tuesday, July 2, 2019

External DNS For Kubernetes Services

A service isn't useful if you can't access it, and while IP addresses are nice, it doesn't really help deliver user facing services. Really we want DNS, but given the dynamic nature of kubernetes it's impractical to implement the static configurations of the past. To solve that, we're going to implement ExternalDNS for kubernetes which will scan services and ingress points to automatically create and destroy DNS records for the cluster. Of course, nothing is completely simple in kubernetes, so we'll need a few pieces in place:
  • ExternalDNS - the scanning engine to create and destroy DNS records
  • CoreDNS - a lightweight kubernetes based DNS server to respond to client requests
  • Etcd - a key/value store to hold DNS records

Namespace

The first thing we're going to need is a namespace to put things. I normally keep this with one of the key pieces but felt it was better as a separate file in this case.
$ cat dns-namespace.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: dns

Etcd Cluster Setup

Technically we only need one etcd node as we don't really need the data to persist, it'd just be regenerated on the next scan, but it would halt all non-cached dns queries, so, I opted to create 3 instances. I didn't want to use an external etcd discovery service so I needed to have predictable pod names, and in order to do that, we need a stateful set rather than a deployment. If we lost a pod in the stateful set, the pod won't rejoin the cluster without having a persistent volume containing the configuration information, which is why we have a small pv for each.

If you're going to change any of the names, make sure the service name "etcd-dns" exactly matches the stateful set name. If it doesn't, kubernetes won't create an internal DNS record and the nodes won't be able to find each other; speaking from experience.
$ cat etcd.yaml 
apiVersion: v1
kind: Service
metadata:
  name: etcd-dns
  namespace: dns
spec:
  ports:
  - name: etcd-client
    port: 2379
    protocol: TCP
  - name: etcd-peer
    port: 2380
    protocol: TCP
  selector:
    app: etcd-dns
  publishNotReadyAddresses: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-dns
  namespace: dns
  labels:
    app: etcd-dns
spec:
  serviceName: "etcd-dns"
  replicas: 3
  selector:
    matchLabels:
      app: etcd-dns
  template:
    metadata:
      labels:
        app: etcd-dns
    spec:
      containers:
      - name: etcd-dns
        image: quay.io/coreos/etcd:latest
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        env:
        - name: CLUSTER_SIZE
          value: "3"
        - name: SET_NAME
          value: "etcd-dns"
        volumeMounts:
        - name: datadir
          mountPath: /var/run/etcd
        command:
          - /bin/sh
          - -c
          - |
            IP=$(hostname -i)
            PEERS=""
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
                PEERS="${PEERS}${PEERS:+,}${SET_NAME}-${i}=http://${SET_NAME}-${i}.${SET_NAME}:2380"
            done

            exec /usr/local/bin/etcd --name ${HOSTNAME} \
              --listen-peer-urls http://${IP}:2380 \
              --listen-client-urls http://${IP}:2379,http://127.0.0.1:2379 \
              --advertise-client-urls http://${HOSTNAME}.${SET_NAME}:2379 \
              --initial-advertise-peer-urls http://${HOSTNAME}.${SET_NAME}:2380 \
              --initial-cluster-token etcd-cluster-1 \
              --initial-cluster ${PEERS} \
              --initial-cluster-state new \
              --data-dir /var/run/etcd/default.etcd
        ports:
        - containerPort: 2379
          name: client
          protocol: TCP
        - containerPort: 2380
          name: peer
          protocol: TCP
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi
Cluster initialization is the more complicated part in this set. We're running some shell commands within the newly booted pod to fill in the required values with the PEERS variable looking like this when it's done. Could you hard code it? Sure, but that would complicate things if you change the set name or number of replicas. You can also do lots and lots of fancy stuff to remove, add, or rejoin nodes but we don't really need more than an initial static value (three in this case) so I'll leave things simple. You can check out the links in the notes section for more complicated examples.
etcd-dns-0=http://etcd-dns-0.etcd-dns:2380,etcd-dns-1=http://etcd-dns-1.etcd-dns:2380,etcd-dns-2=http://etcd-dns-2.etcd-dns:2380
If you'd like to enable https on your etcd cluster, you can easily do so by adding --auto-tls and --peer-auto-tls but this will create problems getting coredns and external-dns to connect without adding the certs there too.

CoreDNS Setup

As the end point to actually serve client requests, this is also an important piece to ensure it stays running, however, we don't really care about the data as it's backed by etcd. So, to handle this, we'll use a 3 pod deployment with a front end service. This uses a service type of LoadBalancer making it easily available to clients, so make sure you have that available. If you don't, see a previous post to install and configure MetalLB.

You might also notice that we're opening up both TCP and UDP DNS ports but only exposing UDP from the load balancer. This is largely because a load balancer can't implement both UDP and TCP at the same time, so feel free to remove TCP if you like. At some point I have hope multi protocol load balancers will be easier to manage so for now I'm leaving it in.
$ cat coredns.yaml 
apiVersion: v1
kind: Service
metadata:
  name: coredns
  namespace: dns
spec:
  ports:
  - name: coredns
    port: 53
    protocol: UDP
    targetPort: 53
  selector:
    app: coredns
  type: LoadBalancer
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: dns
data:
  Corefile: |
    . {
        errors
        health
        log
        etcd {
           endpoint http://etcd-dns:2379
        }
        cache 30
        prometheus 0.0.0.0:9153
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: dns
  labels:
    app: coredns
spec:
  replicas: 3
  selector:
    matchLabels:
      app: coredns
  template:
    metadata:
      labels:
        app: coredns
        k8s_app: kube-dns
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9153"
        prometheus.io/path: /metrics
    spec:
      containers:
      - name: coredns
        image: coredns/coredns:latest
        imagePullPolicy: IfNotPresent
        args: [ "-conf", "/etc/coredns/Corefile" ]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
      volumes:
      - name: config-volume
        configMap:
          name: coredns
          items:
          - key: Corefile
            path: Corefile
There are quite a few plugins [https://coredns.io/plugins/] you can apply to your coredns implementation, some of which you might want to play with. The documentation for these is quite good and easy to implement; they'd go in the ConfigMap with the errors and health entry. Just add the plugin name and any parameters they might take on a line and you're good to go. You may want to remove the log entry if your dns server is really busy or you don't want to see the continual stream of dns updates.

I'll also make special mention of the . { } block in the config map. This tells coredns to accept an entry for any domain which might not be to your liking. In my opinion, this provides the most flexibility as this shouldn't be your site's primary DNS server. Requests for a specific domain or subdomain should be forwarded here from your primary DNS, however, if you want to change this you'd simply enter one or more blocks such as example.org { } instead of . { }.

External DNS

Finally, the reason where here, deploying external-dns to our cluster. A couple of notes here; I've selected to scan the cluster for new or missing services every 15 seconds. This makes the DNS system feel very snappy when creating a service but might be too much or too little for your environment. I found the documentation particularly frustrating here. The closest example I found using coredns leverages minikube with confusing options and commands to diff a helm chart which doesn't feel very complete or intuitive to me.
$ cat external-dns.yaml 
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get","watch","list"]
- apiGroups: ["extensions"]
  resources: ["ingresses"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: registry.opensource.zalan.do/teapot/external-dns:latest
        args:
        - --source=service
        - --source=ingress
        - --provider=coredns
        - --registry=txt
        - --log-level=info
        - --interval=15s
        env:
          - name: ETCD_URLS 
            value: http://etcd-dns:2379
I've left the log-level entry in although the default is info anyway as it's a helpful placeholder when you want/need to change it. The log options, which I couldn't find any documentation for and had to look within the code are: panic, debug, info, warning, error, fatal. You'll also notice a reference to our Etcd cluster service here so if you've changed that name make sure you change it here too.

Deployment and Cleanup Scripts

As I like to do, here are some quick deployment and cleanup scripts which can be helpful when testing over and over again:
$ cat deploy.sh 
kubectl create -f dns-namespace.yaml
kubectl create -f etcd.yaml
kubectl create -f external-dns.yaml
kubectl create -f coredns.yaml
As a reminder, deleting the namespace will cleanup all the persistent volumes too. All of the data will be recreated on the fly but it means a few extra seconds for the system to reclaim them and recreate when you deploy.
$ cat cleanup.sh 
kubectl delete namespace dns
kubectl delete clusterrole external-dns
kubectl delete clusterrolebinding external-dns-viewer

Success State

I also had trouble finding out what good looked like so here's what you're looking for in the logs:
$ kubectl logs -n dns external-dns-57959dcfd8-fgqpn
time="2019-06-27T01:45:21Z" level=error msg="context deadline exceeded"
time="2019-06-27T01:45:31Z" level=info msg="Add/set key /skydns/org/example/nginx/66eeb21d to Host=10.9.176.196, Text=\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/default/nginx-frontend\", TTL=0"
The actual pod name will be different for you as we used a deployment. You can get the exact name using kubectl get pods -n dns. In this example, the "context deadline exceeded" is bad. It means external dns wasn't able to register the entry with etcd, in this case because that cluster was still booting. The last line shows a successful update into etcd.

Etcd has too much to post here, but you'll see entries indicating it can't resolve a host as they boot up, and potentially several MsgVote requests as the services start on all pods. In the end it should establish a peer connection with all of the nodes and indicate the api is enabled.
$ kubectl logs -n dns etcd-dns-0
2019-06-27 01:45:15.124897 W | rafthttp: health check for peer c77fa62c6a3a8c7e could not connect: dial tcp: lookup etcd-dns-1.etcd-dns on 10.96.0.10:53: no such host
2019-06-27 01:45:15.128194 W | rafthttp: health check for peer dcb7067c28407ab9 could not connect: dial tcp: lookup etcd-dns-2.etcd-dns on 10.96.0.10:53: no such host

2019-06-27 01:45:15.272084 I | raft: 7300ad5a4b7e21a6 received MsgVoteResp from 7300ad5a4b7e21a6 at term 4
2019-06-27 01:45:15.272096 I | raft: 7300ad5a4b7e21a6 [logterm: 1, index: 3] sent MsgVote request to c77fa62c6a3a8c7e at term 4
2019-06-27 01:45:15.272105 I | raft: 7300ad5a4b7e21a6 [logterm: 1, index: 3] sent MsgVote request to dcb7067c28407ab9 at term 4
2019-06-27 01:45:17.127836 E | etcdserver: publish error: etcdserver: request timed out

2019-06-27 01:45:41.087147 I | rafthttp: peer dcb7067c28407ab9 became active
2019-06-27 01:45:41.087174 I | rafthttp: established a TCP streaming connection with peer dcb7067c28407ab9 (stream Message writer)
2019-06-27 01:45:41.098636 I | rafthttp: established a TCP streaming connection with peer dcb7067c28407ab9 (stream MsgApp v2 writer)
2019-06-27 01:45:42.350041 N | etcdserver/membership: updated the cluster version from 3.0 to 3.3
2019-06-27 01:45:42.350158 I | etcdserver/api: enabled capabilities for version 3.3
If your cluster won't start or ends up in a CrashLoopBackOff, most of the time I found the problem to be host resolution (dns). You can try changing the PEER entry from ${SET_NAME}-${i}.${SET_NAME} to just ${SET_NAME}. This won't let the cluster work, but should let you get far enough to see what's going on inside the pod. I'd also recommend setting the replicas to 1 when troubleshooting.

CoreDNS is pretty straight forward. It'll just log a startup and then client queries which looks like these examples, where the first response, nginx.example.org, returns noerror (this is good) and the second, ngingx2.example.org, returning nxdomain meaning the record doesn't exist. Again, if you want to cut down on these messages remove the log line from the config file as stated above
$ kubectl logs -n dns coredns-6c8d7c7d79-6jm5l
.:53
2019-06-27T01:44:44.570Z [INFO] CoreDNS-1.5.0
2019-06-27T01:44:44.570Z [INFO] linux/amd64, go1.12.2, e3f9a80
CoreDNS-1.5.0
linux/amd64, go1.12.2, e3f9a80
2019-06-27T02:11:43.552Z [INFO] 192.168.215.64:58369 - 10884 "A IN nginx.example.org. udp 35 false 512" NOERROR qr,aa,rd 68 0.002999881s
2019-06-27T02:13:08.448Z [INFO] 192.168.215.64:64219 - 40406 "A IN nginx2.example.org. udp 36 false 512" NXDOMAIN qr,aa,rd 87 0.007469218s

Using External DNS

To actually have a DNS name register with external DNS, you need to add an annotation to your service. Here's one for nginx that would register an external load balancer and that IP with the name nginx.example.org
$ cat nginx-service.yaml 
apiVersion: v1
kind: Service
metadata:
  name: nginx-frontend
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "nginx.example.org"
spec:
  ports:
  - name: "web"
    port: 80
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
From a linux or mac host, you can use nslookup to verify the entry where 10.9.176.212 is the IP of my coredns service.
$ kubectl get svc -n dns
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)             AGE
coredns    LoadBalancer   10.100.208.145   10.9.176.212   53:31985/UDP        20h
etcd-dns   ClusterIP      10.100.83.154    <none>         2379/TCP,2380/TCP   20h
$ nslookup nginx.example.org 10.9.176.212
Server:  10.9.176.212
Address: 10.9.176.212#53

Name: nginx.example.org
Address: 10.9.176.213

Notes

Kubernetes already comes with an etcd, and for newer releases, coredns, so why not use those? We'll you probably can but, in my opinion, these are meant for core cluster functions and we shouldn't be messing around with them, and, they're secured with https so you'd need to go through the process of getting certificates set up. While I didn't find any links that really suited my needs, here are some that helped me along, maybe they'll help you too.