Sunday, June 20, 2010

Linux Virtual File System (VFS)

Every file system under Linux is represented to a user process, not directly, but through a virtual file system layer. This allows the underlying structure to change, for example from reiserfs to xfs to ext4 without having to change any application code. For each file system available there is either a loadable or an integrated kernel module available. This module is responsible for the low level operations but also to provide standard information back to the VFS layer. You can see which modules have registered by looking at /proc/filesystems.
# cat /proc/filesystems
nodev   sysfs
nodev   rootfs
nodev   bdev
nodev   proc
nodev   tmpfs
nodev   devtmpfs
nodev   debugfs
nodev   securityfs
nodev   sockfs
nodev   usbfs
nodev   pipefs
nodev   anon_inodefs
nodev   inotifyfs
nodev   devpts
        ext3
        ext2
nodev   ramfs
nodev   hugetlbfs
        iso9660
nodev   mqueue
        ext4
nodev   fuse
        fuseblk
nodev   fusectl
nodev   vmblock
The first column indicates if the file system requires a block device or not. The second is the file system name as it is registered to the kernel.

When a filesystem is mounted, the mount command always passes three pieces of information to the kernel; the physical block device, the mount point, and the file system type. However, we generally don't specify the file system type at least on the command line and looking at man mount(8), it shows that this information is optional. So how does the kernel know which module to load? As it turns out, mount makes a library call to libblkid which is capable of determining quite a range of file system types. There is a user space program which will also use libblkid, aptly named blkid. Feel free to have a look at the source for blkid to see the full file system list. You can also run it against your local system to see the results it produces.
# blkid /dev/sdb1
/dev/sdb1: UUID="06749374749364E9" TYPE="ntfs"
# blkid /dev/sda1
/dev/sda1: UUID="207abd21-25b1-43bb-81d3-1c8dd17a0600" TYPE="swap"
# blkid /dev/sda2
/dev/sda2: UUID="67ea3939-e60b-4056-9465-6102df51c532" TYPE="ext4"
Of course if blkid isn't able to determine the type shown with the error mount: you must specify the filesystem type it has to be specified by hand with the -t option. Now if we look at an strace from a mount command we can see the system call in action. The first example is a standard file system requiring a block device, the second is from sysfs. Notice how mount still passes the three options.
# strace mount -o loop /test.img /mnt
...
stat("/sbin/mount.vfat", 0x7fff1bd75b80) = -1 ENOENT (No such file or directory)
mount("/dev/loop0", "/mnt", "vfat", MS_MGC_VAL, NULL) = 0
...

# strace mount -t sysfs sys /sys
...
stat("/sbin/mount.sysfs", 0x7fff21628c30) = -1 ENOENT (No such file or directory)
mount("/sys", "/sys", "sysfs", MS_MGC_VAL, NULL) = 0
...
Looking at the system call mount(2), we can see there are actually five required arguments; source, target, file system type, mount flags, and data. The mount flag in this case is MS_MGC_VAL which is ignored as of the 2.4 kernel but there are several other options that will look familiar. Have a look at the man page for a full list.

The kernel can now request the proper driver (loaded by kerneld) which is able to query the superblock from the physical device and initialize its internal variables. There are several fundamental data types held within VFS as well as multiple caches to speed data access.

Superblock
Every mounted file system has a VFS superblock which contains key records to enable retrieval of full file system information. It identifies the device the file system lives, its block size, file system type, a pointer to the first inode of this file system (a dentry pointer), and a pointer to file system specific methods. These methods allow a mapping between generic functions and a file system specific one. For example a read inode call can be referenced generically under VFS but issue a file system specific command. Applications are able to make common system calls regardless of the underlying structure. It also means VFS is able to cache certain lookup data for performance and provide generic features like chroot for all file systems.

Inodes
An index node (inode) contains the metadata for a file and in Linux, everything is a file. Each VFS inode is kept only in the kernel's memory and its contents are built from the underlying file system. It contains the following attributes; device, inode number, access mode (permissions), usage count, user id (owner), group id (group), rdev (if it's a special file), access time, modify time, create time, size, blocks, block size, a lock, and a dirty flag.

A combination of the inode number and the mounted device is used to create a hash table for quick lookups. When a command like ls makes a request for an inode its usage counter is increased and operations continue. If it's not found, an free VFS inode must be found so that the file system can read it into memory. To do this there are a two options; new memory space can be provisioned, or if all the available inode cache is used, an existing one can be reused selecting from those with a usage count of zero. Once an inode is found, a file system specific methods is called read from the disk and data is populate as required.

Dentries
A directory entry (dentry) is responsible for managing the file system tree structure. The contents of a dentry is a list of inodes and corresponding file names as well as the parent (containing) directory, superblock for the file system, and a list of subdirectories. With both the parent and a list of subdirectories kept in each dentry, a chain in either direction can be reference to allow commands to quickly traverse the full tree. As with inodes, directory entries are cached for quick lookups although instead of a usage count the cache uses a Least Recently Used model. There is also an indepth article of locking and scalability of the directory entry cache found here.

Data Cache
Another vital service VFS provides is an ability to cache file level data as a series of memory pages. A page is a fixed size of memory and is the smallest unit for performing both memory allocation and transfer between main memory and a data store such as a hard drive. Generally this is 4KB for an x64 based system, however, huge pages are supported in the 2.6 kernel providing sizes as large as 1GB. You can find the page size for your system by typing getconf PAGESIZE, the results are in bytes.

When the overall memory of a system becomes strained, VFS may decide to swap out portions to available disk. This of course can have a serious impact to application performance, however, there is a way to control this; swappiness. Having a look at /proc/sys/vm/swappiness will show the current value, a lower number means the system will swap less, a higher will swap more. To prevent swapping all together type:
# echo 0 > /proc/sys/vm/swappiness
To make this change persistent across a reboot edit /etc/sysctl.conf with the following line
vm.swappiness=0
Of course you may not want to turn swap off entirely so some testing to find the right balance may be in order.

A second layer of caching available to VFS is the buffer cache. Its job is to store copies of physical blocks from a disk. With the 2.4 kernel, a cache entry (referenced by a buffer_head) would contain a copy of one physical block, however, since version 2.6 a new structure has been introduced called a BIO. While the fundamentals remain the same, the BIO is also able to point to other buffers as a chain. This means blocks are able to be logically grouped as a larger entity such as an entire file. This improves performance for common application functions and allows the underlying systems to make better allocation choices.

The Big Picture
Here are the components described above put together.

Controlling VFS
vmstat
VMstat gives us lots of little gems into how the overall memory, cpu, and file system cache is behaving.
# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 6987224  30792 313956    0    0    85    53  358  651  3  1 95  1  0
Of particular interest to VFS is the amount of free memory. From the discussion above, buff refers to the size of block data cached in bytes, and cache refers to the size of file data kept in pages. The amount of swap used and active swap operations can have significant performance impact and is also available here shown as memory pages in (read) and pages out (write).

Other items shown are r for number of processes waiting to be executed (run queue) and b for number of processes blocking on I/O. Under System, in shows the number of interrupts per second, and cs shows the number of context switches per second. IO shows us the number of blocks in an out from physical disk. Block size for a given file system can be shown using stat -f or tune2fs -l against a physical device.

Flushing VFS
It is possible to manually request a flush of clean blocks from the vfs cache through the /proc file system.
Free page cache
# echo 1 > /proc/sys/vm/drop_caches
Free dentries and inodes
# echo 2 > /proc/sys/vm/drop_caches
Free page cache, dentries, and inodes
#echo 3 > /proc/sys/vm/drop_caches
While not required, it is a good idea to first run sync to force any dirty block to disk. An unmount and remount will also flush out all cache entries but can be disruptive depending on other system functions. This can be a useful tool when performing disk based benchmark exercises.

slabtop
Slabinfo provides overall kernel memory allocation and within that includes some specific statistics pertaining to VFS. Items such as number of inodes, dentries, and buffer_head, a wrapper to BIO are available.

No comments:

Post a Comment