Logical Shift: performance

Showing posts with label performance. Show all posts

Friday, July 12, 2019

Kuberentes Infrastructure Overview

I've posted several blog entries to setup various parts of an on-premise Kubernetes installation. This is meant as a summary referencing code posted to github for easy access. You can clone the entire repository, edit the required files and use deploy.sh/cleanup.sh scripts, or run the deployment directly from github as documented below. Each of the headers below is a link to the corresponding blog describing the process in detail.

If you'd like to clone the code run this command.

[root@kube-master ~]# git clone https://github.com/mike-england/kubernetes-infra.git

Cluster Install

While this can be automated through templates or tools like terraform, for now, I recommend following the post specifically for this.

Logging

This setup can be almost entirely automated, but unfortunately you'll need to modify the elasticsearch output in the config file

[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-role.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-configmap.yaml
<modify output server entry elasticsearch.prod.int.com entry and index to match your kubernetes cluster name>
[root@kube-master ~]# kubectl create -f fluent-bit-configmap.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/logging/fluent-bit-daemon-set.yaml

Load Balancing

Installation from metallb is straight forward. As with logging, you'll need to modify the config map, this time changing the IP range. If you're running a cluster with windows nodes, be sure to patch the metallb daemonset so it doesn't get deployed to any of those nodes.

[root@kube-master ~]# kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/metal-config.yaml
<modify ip address range>
[root@kube-master ~]# kubectl create -f metal-config.yaml
if you're running a mixed cluster with windows nodes
[root@kube-master ~]# wget https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/load_balancer/node-selector-patch.yaml
[root@kube-master ~]# kubectl patch ds/speaker --patch "$(cat node-selector-patch.yaml)" -n=metallb-system

Monitoring

Assuming you have the load balancer installed above, you should be able to deploy monitoring without any changes.

[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-prometheus.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-config-map.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-server.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-node-exporter.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/clusterRole-kube-state.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/monitoring/prometheus-kube-state.yaml

DNS Services

Again, with the load balancer in place, this should be deployable as is.

[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/dns-namespace.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/etcd.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/external-dns.yaml
[root@kube-master ~]# kubectl create -f https://raw.githubusercontent.com/mike-england/kubernetes-infra/master/external_dns/coredns.yaml

Wednesday, June 19, 2019

Monitoring Kubernetes With Prometheus

There are several ways to monitor a kubernetes cluster, some free, some paid, and some specific to a vendor's cluster implementation. If you're looking for the easy button, look at purchasing a solution; there are many out there, however, for this article, we'll be deploying something slightly more confusing but free using Prometheus as a monitoring solution.

Let's start with a few concepts:

Prometheus pulls data through a process called a scrape. Scraping is a handy approach as you don't need agents pushing data everywhere but can limit your scalability
Prometheus uses metric endpoints which are configured using jobs. We'll be installing a couple of end points to look at specific pieces of the infrastructure, but if you want your applications to be included, they'll need to support prometheus and have specific tags setup in their deployment file
There's lots of documentation about Prometheus' ability to self monitor and automatically pick up new end points; this is awesome

There are many different endpoints you can choose and things can become confusing very quickly, so hopefully this guide will give you a base implementation to expand on as you see fit for your environment. To do this there will be three base components we'll be setting up:

prometheus - the collector and repository of metric data; obviously
kube-state-metrics - deep kubernetes metrics such as pod and node utilization. This is kind of like a more detailed version of metrics-server which is itself a replacement for Heapster. It's confusing, I know.
node_exporter - collected detailed node metrics. There will be one endpoint per node in the cluster

Namespace And Role Setup

I like the idea of keeping my monitoring pieces in its own namespace so we'll be creating a monitoring namespace and creating some roles for the different components to use. I've included that namespace creation during the prometheus cluster role setup, if you're doing things differently, make sure to take that into account. They're long, sorry, and suitable for github, but I wanted to make this as easy to follow as possible:

$ cat clusterRole-prometheus.yaml 
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

And our cluster role and service account setup for kube-state-metrics.

$ cat clusterRole-kube-state.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - ingresses
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - statefulsets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
- apiGroups: ["policy"]
  resources:
  - poddisruptionbudgets
  verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
  resources:
  - certificatesigningrequests
  verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources:
  - storageclasses
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring

Prometheus Configuration

The next thing we'll need is a configuration map for Prometheus itself. I've elected to collect data every minute, there are examples where people are collecting every 5 seconds, pick what makes sense to you. The scrape jobs are largely specific to the type of end point in use. If you're within the same cluster, the certificates and bearer tokens will automatically be pulled into the appropriate containers but if not you'll need to reference them directly which is out of scope for this article.

$ cat prometheus-config-map.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 1m
      evaluation_interval: 1m
      scrape_timeout: 10s
    rule_files:
      - /etc/prometheus/prometheus.rules
    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secretes/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-nodes-cadvisor'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name

Prometheus Deployment

Once the configuration file has been setup it's time to deploy Prometheus itself. This configuration is using a deployment, you could easily convert this to a stateful set if you like. Please note the persistent volume created here belongs to the monitoring namespace, so if you clean that namespace, for example to reload the environment, you will wipe the old data. Also, keep in mind the storage class in use, I'm using vsphere-ssd from my original blog post which might not be suitable for your environment. I'm also using a NodePort as it's always available and doesn't require additional network components but the inbound port will change every time you deploy so that might not be ideal long term.

$ cat prometheus-server.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: monitoring
  labels:
    app: prometheus
spec:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: "vsphere-ssd"
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      securityContext:
        fsGroup: 65534
      containers:
      - name: prometheus
        image: prom/prometheus:v2.10.0
        volumeMounts:
          - name: prometheus-config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yml
          - name: data
            mountPath: /prometheus
        ports:
        - containerPort: 9090
      volumes:
        - name: prometheus-config-volume
          configMap:
           name: prometheus-server-conf
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-data
      serviceAccountName: prometheus
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
spec:
  selector:
    app: prometheus-server
  ports:
  - name: promui
    protocol: TCP
    port: 9090
    targetPort: 9090
  type: NodePort

End Point Deployment

Our last two configuration files are to set up a deployment and internal service for kube-state-metrics and a daemon set and internal service for node-exporter. Because this is a daemon set, it'll get a copy on every node automatically when one is added. Self managed monitoring!

$ cat prometheus-kube-state.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.6.0
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    k8s-app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics

$ cat prometheus-node-exporter.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    k8s-app: node-exporter
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v0.18.1
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
      version: v0.18.1
  updateStrategy:
    type: OnDelete
  template:
    metadata:
      labels:
        k8s-app: node-exporter
        version: v0.18.1
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      containers:
        - name: prometheus-node-exporter
          image: prom/node-exporter:v0.18.1
          imagePullPolicy: "IfNotPresent"
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
          ports:
            - name: metrics
              containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly:  true
            - name: sys
              mountPath: /host/sys
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
      hostNetwork: true
      hostPID: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
---
apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "NodeExporter"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9100
      protocol: TCP
      targetPort: 9100
  selector:
    k8s-app: node-exporter

Deployment and Cleanup Scripts

There are a lot of yaml files here and things get annoying when you want to deploy, test, clean, repeat, so here are two simple scripts that will deploy all of these files and then clean everything up again if you need them:

$ cat deploy.sh 
kubectl create -f clusterRole-prometheus.yaml
kubectl create -f prometheus-config-map.yaml
kubectl create -f prometheus-server.yaml
kubectl create -f prometheus-node-exporter.yaml
kubectl create -f clusterRole-kube-state.yaml
kubectl create -f prometheus-kube-state.yaml

The bulk of the cleanup happens when you delete the namespace, but remember, this will also delete the persistent volume hosting the prometheus data.

$ cat cleanup.sh 
kubectl delete namespace monitoring
kubectl delete clusterrolebinding prometheus
kubectl delete clusterrole prometheus
kubectl delete clusterrolebinding kube-state-metrics
kubectl delete clusterrole kube-state-metrics

Accessing Prometheus

You can check the pods and get your access point using these commands. In my cluster, I've got two worker nodes so I've got two node-exporters.

$ kubectl get pods -n monitoring
NAME                                  READY   STATUS    RESTARTS   AGE
kube-state-metrics-699fdf75f8-cqq5t   1/1     Running   0          18h
node-exporter-88297                   1/1     Running   0          18h
node-exporter-wb2lk                   1/1     Running   0          18h
prometheus-server-6f9d9d86d4-m8x4f    1/1     Running   0          18h

And the all important services are here.

$ kubectl get svc -n monitoring
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
kube-state-metrics   ClusterIP   10.110.207.159   <none>        8080/TCP,8081/TCP   18h
node-exporter        ClusterIP   None             <none>        9100/TCP            18h
prometheus-service   NodePort    10.101.114.89    <none>        9090:30044/TCP      18h

You can see our NodePort service, which means I can hit either node with the port 30044 and be redirected to my pod on port 9090. Let's try that in a browser: http://k8s-n2.itlab.domain.com:30044/graph, where you should be presented with the Prometheus dashboard.

If you click on Status > Targets you should see everything Prometheus is scraping and you hover of the Labels you will see a lot of information including job="job_name". This can be particularly useful to tie back to your config map, especially when some things show as down.

If you have been following my logging article, there are annotations in there for prometheus, which are now helpful, as any metrics fluentbit provides are automatically added. Targets should look something like this with all endpoints in a state of up.

Grafana Integration

The last step in this guide is where the real work begins. If you don't have a Grafana instance up and running, I'd suggest setting one up on a separate linux box. There are some excellent guides with essentially one command to run: https://grafana.com/docs/installation/rpm/.

After that's done, add a data source of type Prometheus with the web URL you used to access the dashboard above. In my case, http://k8s-n2.itlab.domain.com:30044, and you can start creating dashboards. There are also some prebuilt ones that have been publicly hosted which you can find on Grafana's website, https://grafana.com/dashboards and add them by simply placing the dashboard ID into the grafana import as documented by Grafana: https://grafana.com/docs/reference/export_import/.

Notes

This is just scratching the surface of monitoring but hopefully it gives you enough of a framework to build from. Alert manager is a key piece yet to be discussed, long term storage, and of course dashboards, lots and lots of dashboards. Some articles I found particularly helpful:

Sunday, October 30, 2011

Misaligned I/Os

While not unique to virtualization, it generally doesn't cause much of a problem until you consolidate a whole bunch of poorly setup partitions onto one array that things tend to go from all right to a really bad day. The fundamental problem is a mismatch between where the OS places data and where the storage array ultimately keeps it. Both work in logical chunks of data and both present a virtual view of this to the higher layers. There are two specific cases that I'd like to address, one that you can fix, and one that you can only manage.

Storage Alignment
Lets start with a simple illustration of the problem.

Most legacy operating systems (Linux included) like to include a 63 sector offset at the beginning of a drive. This is a real problem as now every read and write overlaps the block boundaries of a physical array. It doesn't matter if you are using VMFS or NFS to host a data store, it's the same problem. Yes an NFS repository will always be aligned, but remember this is a virtual representation to the OS, which happily messes everything up by offsetting its first partition.

Alignment is bad enough when we read. The storage array will pull two blocks when one is read, but it is of particular importance when we write data. Most arrays use some sort of parity to manage redundancy, and if you need to deal with two blocks for every one write request, the system overhead can be enormous. It's also important to keep in mind that every storage vendor has this issue. Even a raw, single drive can benefit from aligned partitions, especially when we consider most new drives ship with a 4KB sector size called Advanced Format.

The impact to each vendor will be slightly different. For example, EMC uses a 64KB block size, so not every write will be unaligned. NetApp uses a 4KB block, which means every write will be unaligned but they handle writes quite a bit differently as the block doesn't have to go back to the same place it came from. Pick your poison.

As you can see, when the OS blocks are aligned, everything through the stack can run at its optimal rate, where one OS request translates to one storage request.

Correctable Block Alignment
Fortunately most modern operating systems have recognized this problem and there is little to do. For example Windows 2008 now uses a 2048 cylinder offset (1MB) as do most current Linux distributions. For Linux, it is easy enough to check.

# fdisk -lu /dev/sdb

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x77cbefef

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048  3907029167  1953513560   83  Linux

As you can see from this example, my starting cylinder is 2048. After that everything will align, including subsequent partitions. The -u option tells fdisk to display cylinders. I generally recommend this option when creating the partition as well although this seems to be the default for fdisk 2.19. You can also check your file system block size to better visualize how this relates through the stack. The following shows that my ext4 partition is using a 4KB block size:

# dumpe2fs /dev/sdb1 | grep -i 'block size'
dumpe2fs 1.41.14 (22-Dec-2010)
Block size:               4096

Uncorrectable Block Alignment
VMware has come out with a nifty utility I first saw in VDI (now called View) and later placed into their cloud offering called a linked clone. Basically it allows you to create a copy of a machine using very little disk space quickly because it reads common data from the original source and writes data to a new location. Sounds a lot like snap shots doesn't it?

Well the problem with this approach is that every block written requires a little header to tell VMware where this new block belongs in the grand scheme of things. This is similar to our 63 cylinder offset but now for every block, nice. It's a good idea to start your master image off with an aligned file system as it will help with reading data but doesn't amount to much when you write. And does a windows desktop ever like to write. Just to exist (no workload), our testing has shown windows does 1-2 write IOs per second. Linux isn't completely off the hook, but generally does 1/3 to 1/2 of that and isn't that common with linked clones yet as it isn't supported in VDI but will get kicked around in vCloud.

Managing Bad Block Alignment

When using linked clones, there are a few steps you can take to minimize the impact:

Refresh your images as often as you can. This will keep the journal file to a minimum and corresponding system overhead. If you can't refresh images, you probably shouldn't be using linked clones.
Don't turn on extra features for those volumes like array based snapshots or de-duplication. The resources needed to track changes for both of these features can cause significant overhead. Use linked clones or de-dupe, not both.
Monitor your progress on a periodic basis. I do this 3 times a day so we can track changes over time. If you can't measure it, you can't fix it.

In the future VAAI is promised to save us by both VMware and every storage vendor that can spell. It's intent is to perform the same linked clone api call but let the storage array figure out the best method of managing the problem. I've yet to see it work in practice, it's still "in the next release", but I have hope.

Tuesday, June 29, 2010

Dynamic CPU Cores

A neat trick I learned to disable and re-enable a CPU core dynamically in Linux. Handy for testing.

Disable a core
# echo 0 > /sys/devices/system/cpu/cpu1/online
Enable a core
# echo 1 > /sys/devices/system/cpu/cpu1/online

You can't disable CPU0 but all others are fair game.

Sunday, June 20, 2010

Linux Virtual File System (VFS)

Every file system under Linux is represented to a user process, not directly, but through a virtual file system layer. This allows the underlying structure to change, for example from reiserfs to xfs to ext4 without having to change any application code. For each file system available there is either a loadable or an integrated kernel module available. This module is responsible for the low level operations but also to provide standard information back to the VFS layer. You can see which modules have registered by looking at /proc/filesystems.

# cat /proc/filesystems
nodev   sysfs
nodev   rootfs
nodev   bdev
nodev   proc
nodev   tmpfs
nodev   devtmpfs
nodev   debugfs
nodev   securityfs
nodev   sockfs
nodev   usbfs
nodev   pipefs
nodev   anon_inodefs
nodev   inotifyfs
nodev   devpts
        ext3
        ext2
nodev   ramfs
nodev   hugetlbfs
        iso9660
nodev   mqueue
        ext4
nodev   fuse
        fuseblk
nodev   fusectl
nodev   vmblock

The first column indicates if the file system requires a block device or not. The second is the file system name as it is registered to the kernel.

When a filesystem is mounted, the mount command always passes three pieces of information to the kernel; the physical block device, the mount point, and the file system type. However, we generally don't specify the file system type at least on the command line and looking at man mount(8), it shows that this information is optional. So how does the kernel know which module to load? As it turns out, mount makes a library call to libblkid which is capable of determining quite a range of file system types. There is a user space program which will also use libblkid, aptly named blkid. Feel free to have a look at the source for blkid to see the full file system list. You can also run it against your local system to see the results it produces.

# blkid /dev/sdb1
/dev/sdb1: UUID="06749374749364E9" TYPE="ntfs"
# blkid /dev/sda1
/dev/sda1: UUID="207abd21-25b1-43bb-81d3-1c8dd17a0600" TYPE="swap"
# blkid /dev/sda2
/dev/sda2: UUID="67ea3939-e60b-4056-9465-6102df51c532" TYPE="ext4"

Of course if blkid isn't able to determine the type shown with the error mount: you must specify the filesystem type it has to be specified by hand with the -t option. Now if we look at an strace from a mount command we can see the system call in action. The first example is a standard file system requiring a block device, the second is from sysfs. Notice how mount still passes the three options.

# strace mount -o loop /test.img /mnt
...
stat("/sbin/mount.vfat", 0x7fff1bd75b80) = -1 ENOENT (No such file or directory)
mount("/dev/loop0", "/mnt", "vfat", MS_MGC_VAL, NULL) = 0
...

# strace mount -t sysfs sys /sys
...
stat("/sbin/mount.sysfs", 0x7fff21628c30) = -1 ENOENT (No such file or directory)
mount("/sys", "/sys", "sysfs", MS_MGC_VAL, NULL) = 0
...

Looking at the system call mount(2), we can see there are actually five required arguments; source, target, file system type, mount flags, and data. The mount flag in this case is MS_MGC_VAL which is ignored as of the 2.4 kernel but there are several other options that will look familiar. Have a look at the man page for a full list.

The kernel can now request the proper driver (loaded by kerneld) which is able to query the superblock from the physical device and initialize its internal variables. There are several fundamental data types held within VFS as well as multiple caches to speed data access.

Superblock
Every mounted file system has a VFS superblock which contains key records to enable retrieval of full file system information. It identifies the device the file system lives, its block size, file system type, a pointer to the first inode of this file system (a dentry pointer), and a pointer to file system specific methods. These methods allow a mapping between generic functions and a file system specific one. For example a read inode call can be referenced generically under VFS but issue a file system specific command. Applications are able to make common system calls regardless of the underlying structure. It also means VFS is able to cache certain lookup data for performance and provide generic features like chroot for all file systems.

Inodes
An index node (inode) contains the metadata for a file and in Linux, everything is a file. Each VFS inode is kept only in the kernel's memory and its contents are built from the underlying file system. It contains the following attributes; device, inode number, access mode (permissions), usage count, user id (owner), group id (group), rdev (if it's a special file), access time, modify time, create time, size, blocks, block size, a lock, and a dirty flag.

A combination of the inode number and the mounted device is used to create a hash table for quick lookups. When a command like ls makes a request for an inode its usage counter is increased and operations continue. If it's not found, an free VFS inode must be found so that the file system can read it into memory. To do this there are a two options; new memory space can be provisioned, or if all the available inode cache is used, an existing one can be reused selecting from those with a usage count of zero. Once an inode is found, a file system specific methods is called read from the disk and data is populate as required.

Dentries
A directory entry (dentry) is responsible for managing the file system tree structure. The contents of a dentry is a list of inodes and corresponding file names as well as the parent (containing) directory, superblock for the file system, and a list of subdirectories. With both the parent and a list of subdirectories kept in each dentry, a chain in either direction can be reference to allow commands to quickly traverse the full tree. As with inodes, directory entries are cached for quick lookups although instead of a usage count the cache uses a Least Recently Used model. There is also an indepth article of locking and scalability of the directory entry cache found here.

Data Cache
Another vital service VFS provides is an ability to cache file level data as a series of memory pages. A page is a fixed size of memory and is the smallest unit for performing both memory allocation and transfer between main memory and a data store such as a hard drive. Generally this is 4KB for an x64 based system, however, huge pages are supported in the 2.6 kernel providing sizes as large as 1GB. You can find the page size for your system by typing getconf PAGESIZE, the results are in bytes.

When the overall memory of a system becomes strained, VFS may decide to swap out portions to available disk. This of course can have a serious impact to application performance, however, there is a way to control this; swappiness. Having a look at /proc/sys/vm/swappiness will show the current value, a lower number means the system will swap less, a higher will swap more. To prevent swapping all together type:

# echo 0 > /proc/sys/vm/swappiness

To make this change persistent across a reboot edit /etc/sysctl.conf with the following line

vm.swappiness=0

Of course you may not want to turn swap off entirely so some testing to find the right balance may be in order.

A second layer of caching available to VFS is the buffer cache. Its job is to store copies of physical blocks from a disk. With the 2.4 kernel, a cache entry (referenced by a buffer_head) would contain a copy of one physical block, however, since version 2.6 a new structure has been introduced called a BIO. While the fundamentals remain the same, the BIO is also able to point to other buffers as a chain. This means blocks are able to be logically grouped as a larger entity such as an entire file. This improves performance for common application functions and allows the underlying systems to make better allocation choices.

The Big Picture
Here are the components described above put together.

Controlling VFS
vmstat
VMstat gives us lots of little gems into how the overall memory, cpu, and file system cache is behaving.

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 6987224  30792 313956    0    0    85    53  358  651  3  1 95  1  0

Of particular interest to VFS is the amount of free memory. From the discussion above, buff refers to the size of block data cached in bytes, and cache refers to the size of file data kept in pages. The amount of swap used and active swap operations can have significant performance impact and is also available here shown as memory pages in (read) and pages out (write).

Other items shown are r for number of processes waiting to be executed (run queue) and b for number of processes blocking on I/O. Under System, in shows the number of interrupts per second, and cs shows the number of context switches per second. IO shows us the number of blocks in an out from physical disk. Block size for a given file system can be shown using stat -f or tune2fs -l against a physical device.

Flushing VFS
It is possible to manually request a flush of clean blocks from the vfs cache through the /proc file system.

Free page cache
# echo 1 > /proc/sys/vm/drop_caches
Free dentries and inodes
# echo 2 > /proc/sys/vm/drop_caches
Free page cache, dentries, and inodes
#echo 3 > /proc/sys/vm/drop_caches

While not required, it is a good idea to first run sync to force any dirty block to disk. An unmount and remount will also flush out all cache entries but can be disruptive depending on other system functions. This can be a useful tool when performing disk based benchmark exercises.

slabtop
Slabinfo provides overall kernel memory allocation and within that includes some specific statistics pertaining to VFS. Items such as number of inodes, dentries, and buffer_head, a wrapper to BIO are available.