Tuesday, June 29, 2010

Dynamic CPU Cores

A neat trick I learned to disable and re-enable a CPU core dynamically in Linux. Handy for testing.
Disable a core
# echo 0 > /sys/devices/system/cpu/cpu1/online
Enable a core
# echo 1 > /sys/devices/system/cpu/cpu1/online
You can't disable CPU0 but all others are fair game.

Sunday, June 20, 2010

Linux Virtual File System (VFS)

Every file system under Linux is represented to a user process, not directly, but through a virtual file system layer. This allows the underlying structure to change, for example from reiserfs to xfs to ext4 without having to change any application code. For each file system available there is either a loadable or an integrated kernel module available. This module is responsible for the low level operations but also to provide standard information back to the VFS layer. You can see which modules have registered by looking at /proc/filesystems.
# cat /proc/filesystems
nodev   sysfs
nodev   rootfs
nodev   bdev
nodev   proc
nodev   tmpfs
nodev   devtmpfs
nodev   debugfs
nodev   securityfs
nodev   sockfs
nodev   usbfs
nodev   pipefs
nodev   anon_inodefs
nodev   inotifyfs
nodev   devpts
        ext3
        ext2
nodev   ramfs
nodev   hugetlbfs
        iso9660
nodev   mqueue
        ext4
nodev   fuse
        fuseblk
nodev   fusectl
nodev   vmblock
The first column indicates if the file system requires a block device or not. The second is the file system name as it is registered to the kernel.

When a filesystem is mounted, the mount command always passes three pieces of information to the kernel; the physical block device, the mount point, and the file system type. However, we generally don't specify the file system type at least on the command line and looking at man mount(8), it shows that this information is optional. So how does the kernel know which module to load? As it turns out, mount makes a library call to libblkid which is capable of determining quite a range of file system types. There is a user space program which will also use libblkid, aptly named blkid. Feel free to have a look at the source for blkid to see the full file system list. You can also run it against your local system to see the results it produces.
# blkid /dev/sdb1
/dev/sdb1: UUID="06749374749364E9" TYPE="ntfs"
# blkid /dev/sda1
/dev/sda1: UUID="207abd21-25b1-43bb-81d3-1c8dd17a0600" TYPE="swap"
# blkid /dev/sda2
/dev/sda2: UUID="67ea3939-e60b-4056-9465-6102df51c532" TYPE="ext4"
Of course if blkid isn't able to determine the type shown with the error mount: you must specify the filesystem type it has to be specified by hand with the -t option. Now if we look at an strace from a mount command we can see the system call in action. The first example is a standard file system requiring a block device, the second is from sysfs. Notice how mount still passes the three options.
# strace mount -o loop /test.img /mnt
...
stat("/sbin/mount.vfat", 0x7fff1bd75b80) = -1 ENOENT (No such file or directory)
mount("/dev/loop0", "/mnt", "vfat", MS_MGC_VAL, NULL) = 0
...

# strace mount -t sysfs sys /sys
...
stat("/sbin/mount.sysfs", 0x7fff21628c30) = -1 ENOENT (No such file or directory)
mount("/sys", "/sys", "sysfs", MS_MGC_VAL, NULL) = 0
...
Looking at the system call mount(2), we can see there are actually five required arguments; source, target, file system type, mount flags, and data. The mount flag in this case is MS_MGC_VAL which is ignored as of the 2.4 kernel but there are several other options that will look familiar. Have a look at the man page for a full list.

The kernel can now request the proper driver (loaded by kerneld) which is able to query the superblock from the physical device and initialize its internal variables. There are several fundamental data types held within VFS as well as multiple caches to speed data access.

Superblock
Every mounted file system has a VFS superblock which contains key records to enable retrieval of full file system information. It identifies the device the file system lives, its block size, file system type, a pointer to the first inode of this file system (a dentry pointer), and a pointer to file system specific methods. These methods allow a mapping between generic functions and a file system specific one. For example a read inode call can be referenced generically under VFS but issue a file system specific command. Applications are able to make common system calls regardless of the underlying structure. It also means VFS is able to cache certain lookup data for performance and provide generic features like chroot for all file systems.

Inodes
An index node (inode) contains the metadata for a file and in Linux, everything is a file. Each VFS inode is kept only in the kernel's memory and its contents are built from the underlying file system. It contains the following attributes; device, inode number, access mode (permissions), usage count, user id (owner), group id (group), rdev (if it's a special file), access time, modify time, create time, size, blocks, block size, a lock, and a dirty flag.

A combination of the inode number and the mounted device is used to create a hash table for quick lookups. When a command like ls makes a request for an inode its usage counter is increased and operations continue. If it's not found, an free VFS inode must be found so that the file system can read it into memory. To do this there are a two options; new memory space can be provisioned, or if all the available inode cache is used, an existing one can be reused selecting from those with a usage count of zero. Once an inode is found, a file system specific methods is called read from the disk and data is populate as required.

Dentries
A directory entry (dentry) is responsible for managing the file system tree structure. The contents of a dentry is a list of inodes and corresponding file names as well as the parent (containing) directory, superblock for the file system, and a list of subdirectories. With both the parent and a list of subdirectories kept in each dentry, a chain in either direction can be reference to allow commands to quickly traverse the full tree. As with inodes, directory entries are cached for quick lookups although instead of a usage count the cache uses a Least Recently Used model. There is also an indepth article of locking and scalability of the directory entry cache found here.

Data Cache
Another vital service VFS provides is an ability to cache file level data as a series of memory pages. A page is a fixed size of memory and is the smallest unit for performing both memory allocation and transfer between main memory and a data store such as a hard drive. Generally this is 4KB for an x64 based system, however, huge pages are supported in the 2.6 kernel providing sizes as large as 1GB. You can find the page size for your system by typing getconf PAGESIZE, the results are in bytes.

When the overall memory of a system becomes strained, VFS may decide to swap out portions to available disk. This of course can have a serious impact to application performance, however, there is a way to control this; swappiness. Having a look at /proc/sys/vm/swappiness will show the current value, a lower number means the system will swap less, a higher will swap more. To prevent swapping all together type:
# echo 0 > /proc/sys/vm/swappiness
To make this change persistent across a reboot edit /etc/sysctl.conf with the following line
vm.swappiness=0
Of course you may not want to turn swap off entirely so some testing to find the right balance may be in order.

A second layer of caching available to VFS is the buffer cache. Its job is to store copies of physical blocks from a disk. With the 2.4 kernel, a cache entry (referenced by a buffer_head) would contain a copy of one physical block, however, since version 2.6 a new structure has been introduced called a BIO. While the fundamentals remain the same, the BIO is also able to point to other buffers as a chain. This means blocks are able to be logically grouped as a larger entity such as an entire file. This improves performance for common application functions and allows the underlying systems to make better allocation choices.

The Big Picture
Here are the components described above put together.

Controlling VFS
vmstat
VMstat gives us lots of little gems into how the overall memory, cpu, and file system cache is behaving.
# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 6987224  30792 313956    0    0    85    53  358  651  3  1 95  1  0
Of particular interest to VFS is the amount of free memory. From the discussion above, buff refers to the size of block data cached in bytes, and cache refers to the size of file data kept in pages. The amount of swap used and active swap operations can have significant performance impact and is also available here shown as memory pages in (read) and pages out (write).

Other items shown are r for number of processes waiting to be executed (run queue) and b for number of processes blocking on I/O. Under System, in shows the number of interrupts per second, and cs shows the number of context switches per second. IO shows us the number of blocks in an out from physical disk. Block size for a given file system can be shown using stat -f or tune2fs -l against a physical device.

Flushing VFS
It is possible to manually request a flush of clean blocks from the vfs cache through the /proc file system.
Free page cache
# echo 1 > /proc/sys/vm/drop_caches
Free dentries and inodes
# echo 2 > /proc/sys/vm/drop_caches
Free page cache, dentries, and inodes
#echo 3 > /proc/sys/vm/drop_caches
While not required, it is a good idea to first run sync to force any dirty block to disk. An unmount and remount will also flush out all cache entries but can be disruptive depending on other system functions. This can be a useful tool when performing disk based benchmark exercises.

slabtop
Slabinfo provides overall kernel memory allocation and within that includes some specific statistics pertaining to VFS. Items such as number of inodes, dentries, and buffer_head, a wrapper to BIO are available.

Saturday, June 5, 2010

Linux Logical Volume Management

There are a few reasons for using Logical Volume Management; extending the capacity of a file system beyond the available physical spindles by spanning disks, using it to have more dynamic control over disk capacity for example by adding or removing a drive, or to create backups in the form of snapshots. LVM can be applied against any block device such as a physical drive, software raid, or external hardware raid device. The file system is still separate, however, it must be managed in conjunction with LVM to make use of the available block appropriately.

In general there are three basic components:

Physical Disk
  • Initially, each drive is simply marked as available for use in a volume group. This writes a Universally Unique Identifier (UUID) to the initial sectors of the disk and prepares it to receive a volume group

Volume Group
  • A collection of physical disks (or partitions if desired). When created this will designate physical extents to all of its member disks, the default being 4MB. It will also record information about all other physical disks in the group and any logical volumes present.

Logical Volume
  • Most of the work happens at this layer. A logical volume is a mapping between a set of physical extents (PE) from the disk to a set of logical extents (LE). The size of these are always the same and generally the quantity matches one to one. However, it is possible to have two PEs mapping to one LE if mirroring is used.

In the example shown, there is one volume group with two physical drives and two logical volumes mapped. Physical blocks that are not assigned to a logical drive are free and can be used to expand either logical drive at a later time.



Creating a Logical Drive
I a not going to bother with mirrored or stripped volumes. You could make a case for a stripe to increase performance, however, in general I believe it is better to use either the hardware or software raid functions available as they are better suited for that purpose. The steps are fairly simple, mark the device with pvcreate, create a volume group and then assign a logical volume. Depending on how big your volume group is, you may want to consider altering the default physical extent size. The man page for vgcreate states, if the volume group metadata uses lvm2 format those restrictions [65534 extents in each logical volume] do not apply, but having a large number of extents will slow down the tools but have no impact on I/O performance to the logical volume. So if I was creating a terabyte or larger volume, its probably a good idea to increase this to 64MB or even 128MB.

# pvcreate /dev/sdb
No physical volume label read from /dev/sdb
Physical volume "/dev/sdb" successfully created
# pvcreate /dev/sdc
No physical volume label read from /dev/sdc
Physical volume "/dev/sdc" successfully created
# vgcreate -s 16M datavg /dev/sdb /dev/sdc
Volume group "datavg" successfully created
# pvdisplay /dev/sdb
--- Physical volume ---
PV Name /dev/sdb
VG Name datavg
PV Size 10.00GB / not Usable 16.00MB
Allocatable Yes
PE Size (KByte) 16384
Total PE 639
Free PE 639
Allocated PE 0
PV UUID apk7wQ-V9B2-vHVo-L5Yz-81U0-orx7-F8J0MI
# vgdisplay datavg
--- Volume group ---
VG Name datavg
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 2
Act PV 2
VG Size 19.97GB
PE Size 16.00MB
Total PE 1278
Alloc PE / Size 0 / 0
Free PE / Size 1278 / 19.97GB
VG UUID Glyv9C-qRog-YZVk-08nR-csMe-quMp-A3Ksby

As you can see in this example, the volume group named datavg has two member disks each 10GB in size. I selected a different physical extent size not because I had to, just to show how it is done. You will also notice that the available PE size is one less than the total drive space. This is to accommodate the volume group metadata mentioned earlier. You can actually read this data yourself if you like.
# dd if=/dev/sdb of=vg_metadata bs=16M count=1
# strings vg_metadata

The last step is to create the Logical Volume itself. There are a myriad of options available depending on what you want to accomplish, the important ones are:

-L size[KMGTPE]
  • Specifies a size in kilobytes, megabytes, gigabytes, terabytes, petabytes, or exabytes. Let me know if you actually use the last two.

-l size
  • Specifies the size in extents. In this case 16MB each. You can also specify as a percentage of either the Volume Group, free space in the volume group, or free space for the physical volumes with %VG, %FREE, or %PVS respectively.

-n string
  • Gives a name to your logical volume

-i Stripes
  • Number of stripes to use. As I mentioned earlier, you should probably use raid to perform this functionality, but if you must, this should be equal to the number of spindles present in the volume group

-I stripeSize
  • The stripe depth in KB to use for each disk

Here is an example for a simple volume, and then a striped volume

# lvcreate -L 5G -n datalv datavg
Logical volume "datalv" created
# lvdisplay
--- Logical volume ---
LV Name /dev/datavg/datalv
VG Name datavg
LV UUID fCoaFl-7aQY-CX5U-zDwO-at52-udkI-ke6CZn
LV Write Access read/write
# open 0
LV size 5.00GB
Current LE 320
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 1024
Block device 253:0
# lvcreate -L 10G -i 2 -I 64 -n stripedlv datavg

If you are going to use striped volumes you should probably only use striped as it requires the proper number of blocks free on each physical volume. Once we have a volume we need a file system. For this exercise I am going to use ext4, but you can use what you like.

# mkfs.ext4 /dev/datavg/datalv
# mkdir /data
# mount /dev/datavg/datalv /data

Expanding a Logical Volume
# pvcreate /dev/sdd
# vgextend datavg /dev/sdd
Volume group "datavg" successfully extended
# lvresize -L 15G /dev/datavg/datalv
Extending logical volume datalv to 15.00 GB
Logical volume datalv successfully resized
# resize2fs /dev/datavg/datalv
resize2fs 1.41.9 (22-Aug-2009)
Resizing the filesystem on /dev/datavg/datalv to 3932160 (4k) blocks.
The filesystem on /dev/datavg/datalv is now 3932160 blocks long.

Depending on the state of your file system, you may not be able to expand online. You can check the output of tune2fs to ensure GDT blocks have been set aside, without those you will for sure have to be offline. For example, tune2fs -l /dev/datavg/datalv. You may also get a warning to run e2fsck first. The man page warns of running this on-line, so again you are probably best served by unmounting the file system first. If this was a system disk that generally means dropping back down to single user mode.

Reducing a Logical Volume
Before embarking on this journey, ensure you manage the file system first, which for the ext series anyway, means you have to have it unmounted. Once that is done you can go ahead and shrink the logical volume as shown here.

# umount /data
# resize2fs /dev/datavg/datalv 10g
resize2fs 1.41.9 (22-Aug-2009)
Resizing the filesystem on /dev/datavg/datalv to 2621440 (4k) blocks.
The filesystem on /dev/datavg/datalv is now 2621440 blocks long.
# lvreduce -L 10g /dev/datavg/datalv
WARNING: Reducing active and open logical volume to 10.00 GB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce datalv? [y/n]: y
Reducing logical volume datalv to 10.00 GB
Logical volume datalv successfully resized

Again you may be prompted to check your file system but it's unmounted anyway, so it shouldn't be a problem. If the file system is highly fragmented the resize process can take quite a while so be prepared.

Snap shots
Another benefit of lvm is the ability to take point in time images of your file system. Snaps use a copy of write technology where a block that is about to be overwritten or changed is first copied to a new location and then allowed to be altered. This can cause a performance problem on writes which can compound as more snaps are added so bear that in mind. You will also have to set aside some space within the volume group for this purpose. The amount really depends on how many changes you are making, but 10-20% is probably a good starting point. For this example I am going to use 1G as I don't expect many changes.

# lvcreate -L 1g -s -n datasnap1 /dev/datavg/datalv 
Logical volume "datasnap" created

Notice the -s entry for snapshot and that the target isn't the volume group but rather the logical volume desired. It appears there is a bug in OpenSuSE that may be present in other distributions. It prevents the snap from being registered with the event monitor, to alert when full or reaching capacity. If you get this message you will have to upgrade both lvm2 and device-mapper packages as it was compiled against the wrong library versions.

OpenSuSE error:
datavg-datasnap: event registration failed: 10529:3 libdevmapper-event-lvm2snapshot.so.2.02 dlopen failed: /lib64/libdevmapper-event-lvm2snapshot.so.2.02: undefined symbol: lvm2_run
datavg/snapshot0: snapshot segment monitoring function failed.


To use your new snap, simply mount it like you would any other file system with mount /dev/datavg/datasnap1 /datasnap. You can view the snap useage through lvdisplay from the allocated to snapshot field.

# lvdisplay /dev/data/datasnap1
--- Logical volume ---
LV Name /dev/datavg/datasnap1
VG Name datavg
LV UUID 82IA4M-Md6s-MEI6-iNPW-6wFb-8pzD-eCQqmS
LV Write Access read/write
LV snapshot status active destination for /dev/datavg/datalv
LV Status available
# open 0
LV Size 20.00 GB
Current LE 5120
COW-table size 1.00 GB
COW-table LE 256
Allocated to snapshot 68.68%
Snapshot chunk size 4.00 KB
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0

If the snap reserve space fills completely it will not be deleted but marked invalid and cannot be read from, even if it is currently mounted. Snaps aren't good forever but as a point in time image they can be invaluable for providing specific backup scenarios like quick reference points for database backups. Instead of moving the active file system to tape you can quiesce the database, snap it, and return it to normal operations and then perform a backup from the snapshot.

Moving Volume Groups
A handy utility that I have used many times under AIX is also available under Linux; the ability to move a volume group from one system to the next.

# umount /data
# vgchange -an datavg
0 logical volume(s) in volume group "datavg" now active
# vgexport datavg
Volume group "datavg" successfully exported

Shutdown the machine before removing and assigning to another machine.
# pvscan
PV /dev/sdb is in exported VG datavg [10.00 GB / 0 free]
PV /dev/sdc is in exported VG datavg [10.00 GB / 1.99 GB free]
Total: 2 [19.99 GB] / in use: 2 [19.99 GB] / in no VG: 0 [0 ]
# vgscan
Reading all physical volumes. This may take a while...
Found exported volume group "datavg" using metadata type lvm2
# vgimport datavg
Volume group "datavg" successfully imported

You should now be able to mount your file system on the new machine.

Other Commands
Some other important commands for volume management
# lvremove logical_volume_path
e.g. lvremove /dev/datavg/datasnap1
# pvremove device
e.g. pvremove /dev/sdd
# pvmove device
moves data from an existing drive to free extents on other disks in the volume group
e.g. pvmove /dev/sdc
# vgreduce volume_group device
removes a device from a volume group
e.g. vgreduce datavg /dev/sdc
# pvremove device
removes a physical device from lvm
e.g. pvremove /dev/sdc