The Linux kernel implements the concept of Virtual File System (VFS, originally Virtual Filesystem Switch), so that it is (to a large degree) possible to separate actual "low-level" filesystem code from the rest of the kernel. The API of a filesystem is described below.
This API was designed with things closely related to the ext2 filesystem in mind. For very different filesystems, like NFS, there are all kinds of problems.
Four main objects: superblock, dentries, inodes, files
The kernel keeps track of files using in-core inodes ("index nodes"), usually derived by the low-level filesystem from on-disk inodes.
A file may have several names, and there is a layer of dentries ("directory entries") that represent pathnames, speeding up the lookup operation.
Several processes may have the same file open for reading or writing, and file structures contain the required information such as the current file position.
Access to a filesystem starts by mounting it. This operation takes a filesystem type (like ext2, vfat, iso9660, nfs) and a device and produces the in-core superblock that contains the information required for operations on the filesystem; a third ingredient, the mount point, specifies what pathname refers to the root of the filesystem.
Auxiliary objects
We have filesystem types, used to connect the name of
the filesystem to the routines for setting it up (at mount
time)
or tearing it down (at umount
time).
A struct vfsmount represents a subtree in the big file hierarchy - basically a pair (device, mountpoint).
A struct nameidata represents the result of a lookup.
A struct address_space gives the mapping between the blocks in a file and blocks on disk. It is needed for I/O.
Various objects play a role here. There are file systems, organized collections of files, usually on some disk partition. And there are filesystem types, abstract descriptions of the way data is organized in a filesystem of that type, like FAT16 or ext2. And there is code, perhaps a module, that implements the handling of file systems of a given type. Sometimes this code is called a low-level filesystem, low-level since it sits below the VFS just like low-level SCSI drivers sit below the higher SCSI layers.
A module implementing a filesystem type must announce its presence so that it can be used. Its task is (i) to have a name, (ii) to know how it is mounted, (iii) to know how to lookup files, (iv) to know how to find (read, write) file contents.
This announcing is done using the call register_filesystem()
,
either at kernel initialization time or when the module is inserted.
There is a single argument, a struct that contains the name of the
filesystem type (so that the kernel knows when to invoke it) and a
routine that can produce a superblock.
The struct is of type struct file_system_type
.
Here the 2.2.17 version:
struct file_system_type { const char *name; int fs_flags; struct super_block *(*read_super) (struct super_block *, void *, int); struct file_system_type *next; };
The call register_filesystem()
hangs this struct in the chain
with head file_systems
, and
unregister_filesystem()
removes it again.
Accesses to this chain are protected by the spinlock
file_systems_lock
. There are no other writers.
The main reader is of course the mount()
system call
(via get_fs_type()
).
Other readers are get_filesystem_list()
used
for /proc/filesystems
, and the
sysfs
system call.
The code is in fs/filesystems.c
.
static struct file_system_type tue_fs_type = { .owner = THIS_MODULE, .name = "tue", .get_sb = tue_get_sb, .kill_sb = kill_block_super, .fs_flags = FS_REQUIRES_DEV, }; static int __init init_tue_fs(void) { return register_filesystem(&tue_fs_type); } static void __exit exit_tue_fs(void) { unregister_filesystem(&tue_fs_type); }
struct file_system_type { const char *name; int fs_flags; struct super_block *(*get_sb)(struct file_system_type *, int, char *, void *, struct vfsmount *); void (*kill_sb) (struct super_block *); struct module *owner; struct file_system_type *next; struct list_head fs_supers; struct lock_class_key s_lock_key; struct lock_class_key s_umount_key; };
(In 2.4 there was no kill_sb()
, and the role of
get_sb()
was taken by read_super()
.
The final parameter of get_sb()
and the
lock_class_key fields are present since 2.6.18.)
Let us look at the fields of the struct file_system_type.
Here the filesystem type gives its name ("tue"), so that the kernel
can find it when someone does mount -t tue /dev/foo /dir
.
(The name is the third parameter of the mount system call.)
It must be non-NULL. The name string lives in module space.
Access must be protected either by a reference to the module,
or by the file_systems_lock
.
At mount time the kernel calls the fstype->get_sb()
routine
that initializes things and sets up a superblock.
It must be non-NULL. Typically this is a 1-line routine that calls
one of get_sb_bdev
, get_sb_single
, get_sb_nodev
,
get_sb_pseudo
.
The routines get_sb_single
and get_sb_nodev
are
almost identical. Both are for virtual filesystems. The former is used
when there can be at most one instance of the filesystem.
(Now an old instance is used if there is one, but its flags may be changed.)
At umount time the kernel calls the fstype->kill_sb()
routine
to clean up. It must be non-NULL. Typically one of
kill_block_super
, kill_anon_super
,
kill_litter_super
.
The first is normal for filesystems backed by block devices.
The second for virtual filesystems, where the information is
generated on the fly. The third for in-memory filesystems without
backing store - they need an additional dget()
when
a file is created (so that their dentries always have a nonzero
reference count and are not garbage collected), and the d_genocide()
that is the difference between kill_anon_super
and
kill_litter_super
does the balancing dput()
.
The fs_flags
field of a struct file_system_type
is a bitmap, an OR of several possible flags with mostly obscure uses only.
The flags are defined in fs.h
.
This field was introduced in 2.1.43. The number of flags, and their
meanings, varies. In 2.6.19 there are the four flags
FS_REQUIRES_DEV
, FS_BINARY_MOUNTDATA
,
FS_REVAL_DOT
, FS_RENAME_DOES_D_MOVE
.
The FS_REQUIRES_DEV
flag (since 2.1.43) says that this is not
a virtual filesystem - an actual underlying block device is required.
It is used in only two places: when /proc/filesystems
is generated, its absence causes the filesystem type name to be prefixed
by "nodev". And in fs/nfsd/export.c
this flag is tested
in the process of determining whether the filesystem can be exported
via NFS. Earlier there were more uses.
The FS_BINARY_MOUNTDATA
flag (since 2.6.5) is set to tell
the selinux code that the mount data is binary, and cannot be
handled by the standard option parser. (This flag is set for
afs, coda, nfs, smbfs.)
The FS_REVAL_DOT
flag (since 2.6.0test4) is set to tell
the VFS code (in namei.c
) to revalidate the paths "/", ".", ".."
since they might have gone stale. (This flag is set for NFS.)
The FS_RENAME_DOES_D_MOVE
flag (since 2.6.19) says that
the low-level filesystem will handle d_move()
during
a rename()
. Earlier (2.4.0test6-2.6.19) this was called
FS_ODD_RENAME
and was used for NFS only, but
now this is also useful for ocfs2.
See also the discussion of
silly rename.
The FS_NOMOUNT
flag (2.3.99pre7-2.5.22) says that this filesystem
must never be mounted from userland, but is used only kernel-internally.
This was used, for example, for pipefs, the implementation of Unix pipes
using a kernel-internal filesystem (see fs/pipe.c
).
Even though the flag has disappeared, the concept remains,
and is now represented by the MS_NOUSER flag.
The FS_LITTER
flag (2.4.0test3-2.5.7) says that after umount
a d_genocide()
is needed. This will remove one reference
from all dentries in that tree, probably killing all of them, which is
necessary in case at creation time the dentries already got reference
count 1. (This is typically done for an in-core filesystem where dentries
cannot be recreated when needed.) This flag disappeared in Linux 2.5.7
when the explicit kill_super
method kill_litter_super
was introduced.
The FS_SINGLE
flag (2.3.99pre7-2.5.4) says that there is
only a single superblock for this filesystem type, so that
only a single instance of this filesystem may exist,
possibly mounted in several places.
The FS_IBASKET
was defined in 2.1.43 but never used, and
the definition disappeared in 2.3.99pre4.
The FS_NO_DCACHE
and FS_NO_PRELIM
flags were introduced
in 2.1.43, but were a mistake and disappeared again in 2.1.44. However,
the definitions survived until Linux 2.5.22.
For the purposes of these flags, see the comment in 2.1.43:dcache.c.
The owner
field of a struct file_system_type
points at the module that owns this struct. When doing things that
might sleep, we must make sure that the module is not unloaded
while we are using its data, and do this with
try_inc_mod_count(owner)
. If this fails then the module
was just unloaded. If it succeeds we have incremented a reference
count so that the module will not go away before we are done.
This field is NULL for filesystems compiled into the kernel.
There exists a strange SYSV system call sysfs
that will return (i) a sequence number given a filesystem type,
and (ii) a filesystem type given a sequence number, and
(iii) the total number of filesystem types registered now.
This call is not supported by libc or glibc.
These sequence numbers are rather meaningless since they may change
any moment. But this means that one can get a snapshot of the
list of filesystem types without looking at /proc/filesystems
.
For example, the program
#include <stdio.h>
#include <linux/unistd.h>
/* define the 3-arg version of sysfs() */
static _syscall3(int,sysfs,int,option,unsigned int,fsindex,char *,buf);
/* define the 1-arg version of sysfs() */
static int sysfs1(int i) {
return sysfs(i,0,NULL);
}
main(){
int i, tot;
char buf[100]; /* how long is a filesystem type name?? */
tot = sysfs1(3);
if (tot == -1) {
perror("sysfs(3)");
exit(1);
}
for (i=0; i<tot; i++) {
if (sysfs(2, i, buf)) {
perror("sysfs(2)");
exit(1);
}
printf("%2d: %s\n", i, buf);
}
return 0;
}
might give output like
0: ext2
1: minix
2: romfs
3: msdos
4: vfat
5: proc
6: nfs
7: smbfs
8: iso9660
The kernel code for copying the names to user space is instructive:
static int fs_name(unsigned int index, char * buf) { struct file_system_type * tmp; int len, res; read_lock(&file_systems_lock); for (tmp = file_systems; tmp; tmp = tmp->next, index--) if (index <= 0 && try_inc_mod_count(tmp->owner)) break; read_unlock(&file_systems_lock); if (!tmp) return -EINVAL; /* OK, we got the reference, so we can safely block */ len = strlen(tmp->name) + 1; res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0; put_filesystem(tmp); return res; }
In order to walk safely along a linked list we need the read lock.
The routines that change links (like register_filesystem
)
need a write lock. Once the filesystem name with the desired
index is found we cannot just copy this name to user space.
Maybe the page we want to copy to was swapped out, and getting it
back in core takes some time, and maybe the module is unloaded just
at that point, and then, when we want to read the name we reference
memory that is no longer present. The routine try_inc_mod_count()
first gets the module unload lock, then looks whether the module still
is present; if so it increases the module's refcount and returns 1
(after releasing the unload lock), otherwise it returns 0.
After a successful return of try_inc_mod_count()
we own
a reference to the module, so that it cannot disappear while we
are doing copy_to_user()
. The put_filesystem()
decreases the module's refcount again.
So this is how the owner
field is used: it tells which
module must be pinned when we do something with this struct.
A module stays as long as its refcount is positive, but can
disappear any moment when the refcount becomes zero.
In fs/filesystems.c
there is a global variable
static struct file_system_type *file_systems;that is the head of the list of known filesystem types. A
register_filesystem
adds the filesystem to the linked list,
an unregister_filesystem
removes it again.
The field next
is the link in this simply linked list.
It must be NULL when register_filesystem
is called,
and is reset to NULL by unregister_filesystem
.
The list is protected by the file_systems_lock
.
The fs_supers
field of a struct file_system_type
is the head of a list of all superblocks of this type.
In each superblock the corresponding link is called s_instances
.
This list is protected by the spinlock sb_lock
.
This list is used in sget()
for filesystems like NFS
where we get a filehandle and must check each superblock of
the given type whether it is the right one.
These are fields used when CONFIG_LOCKDEP is defined, and take no space otherwise. Used for lock validation.
The mount system call attaches a filesystem to the big file hierarchy at some indicated point. Ingredients needed: (i) a device that carries the filesystem (disk, partition, floppy, CDROM, SmartMedia card, ...), (ii) a directory where the filesystem on that device must be attached, (iii) a filesystem type.
In many cases it is possible to guess (iii) given the bits on the device, but heuristics fail in rare cases. Moreover, sometimes there is no difference on the device, as for example in the case where a FAT filesystem without long filenames must be mounted. Is it msdos? or vfat? That information is only in the user's head. If it must be used later in an environment that cannot handle long filenames it should be mounted as msdos; if files with long names are going to be copied to it, as vfat.
The kernel does not guess (except perhaps at boot time, when the
root device has to be found), and requires the three ingredients.
In fact the mount
system call has five parameters:
there are also mount flags (like "read-only") and options, like
for ext2 the choice between errors=continue
and
errors=remount-ro
and errors=panic
.
The code for sys_mount()
is found in fs/namespace.c
and fs/super.c
. The connection with the filesystem type name
is made in do_kern_mount()
:
struct file_system_type *type = get_fs_type(fstype); struct super_block *sb; if (!type) return ERR_PTR(-ENODEV); sb = type->get_sb(type, flags, name, data);and this is the only call of the
get_sb()
routine.
The code for sys_umount()
is found in fs/namespace.c
and fs/super.c
. The counterpart of the just quoted code
is the cleanup in deactivate_super()
:
fs->kill_sb(s);and this is the only call of the
kill_sb()
routine.
The superblock gives global information on a filesystem: the device on which it lives, its block size, its type, the dentry of the root of the filesystem, the methods it has, etc., etc.
struct super_block { dev_t s_dev; unsigned long s_blocksize; struct file_system_type *s_type; struct super_operations *s_op; struct dentry *s_root; ... }
struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); void (*read_inode) (struct inode *); void (*dirty_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); void (*write_super_lockfs) (struct super_block *); void (*unlockfs) (struct super_block *); int (*statfs) (struct super_block *, struct statfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); int (*show_options)(struct seq_file *, struct vfsmount *); };
This is enough to get started:
the dentry of the root directory tells us the inode of this root directory
(and in particular its i_ino
),
and sb->s_op->read_inode(inode)
will read this inode from disk.
Now inode->i_op->lookup()
allows us to find names in the
root directory, etc.
Each superblock is on six lists, with links through the fields
s_list
, s_dirty
, s_io
, s_anon
,
s_files
, s_instances
, respectively.
All superblocks are collected in a list super_blocks
with links in the fields s_list
.
This list is protected by the spinlock sb_lock
.
The main use is in super.c:get_super()
or user_get_super()
to find the superblock for a given block device.
(Both routines are identical, except that one takes a bdev
,
the other a dev_t
.)
This list is also used various places where all superblocks must be sync'ed
or all dirty inodes must be written out.
All superblocks of a given type are collected in a list headed by the
fs_supers
field of the struct filesystem_type,
with links in the fields s_instances
.
Also this list is protected by the spinlock sb_lock
.
See
above.
All open files belonging to a given superblock are chained in
a list headed by the s_files
field of the superblock,
with links in the fields f_list
of the files.
These lists are protected by the spinlock files_lock
.
This list is used for example in fs_may_remount_ro()
to check that there are no files currently open for writing.
See also
below.
Normally, all dentries are connected to root. However, when
NFS filehandles are used this need not be the case.
Dentries that are roots of subtrees potentially unconnected
to root are chained in a list headed by the s_anon
field of the superblock, with links in the fields d_hash
.
These lists are protected by the spinlock dcache_lock
.
They are grown in dcache.c:d_alloc_anon()
and shrunk
in super.c:generic_shutdown_super()
.
See the discussion in Documentation/filesystems/Exporting
.
Lists of inodes to be written out.
These lists are headed at the s_dirty
(resp. s_io
)
field of the superblock, with links in the fields i_list
.
These lists are protected by the spinlock inode_lock
.
See fs/fs-writeback.c
.
An (in-core) inode contains the metadata of a file: its serial number, its protection (mode), its owner, its size, the dates of last access, creation and last modification, etc. It also points to the superblock of the filesystem the file is in, the methods for this file, and the dentries (names) for this file.
struct inode { unsigned long i_ino; umode_t i_mode; uid_t i_uid; gid_t i_gid; kdev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_ctime; struct timespec i_mtime; struct super_block *i_sb; struct inode_operations *i_op; struct address_space *i_mapping; struct list_head i_dentry; ... }
In early times, struct inode
would end with a union
union { struct minix_inode_info minix_i; struct ext2_inode_info ext2_i; struct ext3_inode_info ext3_i; struct hpfs_inode_info hpfs_i; ... } u;to store the filesystemtype specific stuff. One could go from
inode
to e.g. struct ext3_inode_info
via inode->u.ext3_i
.
This setup was rather dissatisfactory, since it meant that
a core data structure had to know about all possible
filesystem types (even possible out-of-tree ones)
and reserve enough room for the largest one among the
struct foofs_inode_info
. It also wasted memory.
In Linux 2.5.3 this system was changed, and instead of
a big struct inode
having a filesystemtype dependent part,
we now have big filesystemtype dependent inodes, with a VFS part.
Thus, struct ext3_inode_info
has as its last field
struct inode vfs_inode;
, and given the VFS inode inode
one finds the ext3 information via EXT3_I(inode)
, defined as
container_of(inode, struct ext3_inode_info, vfs_inode)
.
See also the discussion of
container_of.
The methods of an inode are given in the struct inode_operations
.
struct inode_operations { int (*create) (struct inode *, struct dentry *, int); struct dentry * (*lookup) (struct inode *, struct dentry *); int (*link) (struct dentry *, struct inode *, struct dentry *); int (*unlink) (struct inode *, struct dentry *); int (*symlink) (struct inode *, struct dentry *, const char *); int (*mkdir) (struct inode *, struct dentry *, int); int (*rmdir) (struct inode *, struct dentry *); int (*mknod) (struct inode *, struct dentry *, int, dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char *,int); int (*follow_link) (struct dentry *, struct nameidata *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *, const void *, size_t, int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); };
Each inode is on four lists, with links through the fields
i_hash
, i_list
, i_dentry
,
i_devices
.
All dentries belonging to this inode (names for this file)
are collected in a list headed by the inode field i_dentry
with links in the dentry fields d_alias
.
This list is protected by the spinlock dcache_lock
.
All inodes live in a hash table, with hash collision chains
through the field i_hash
of the inode.
These lists are protected by the spinlock inode_lock
.
The appropriate head is found by a hash function; it will be
an element of the inode_hashtable[]
array when the
inode belongs to a superblock, or anon_hash_chain
if not.
Inodes are collected into lists that use the i_list
field as link field. The lists are protected by the spinlock
inode_lock
. An inode is either unused, and then on
the chain with head inode_unused
, or in use but not
dirty, and then on the chain with head inode_in_use
,
or dirty, and then on one of the per-superblock lists with heads
s_dirty
or s_io
, see
above.
Inodes belonging to a given block device are collected into
a list headed by the bd_inodes
field of the block device,
with links in the inode i_devices
fields.
The list is protected by the bdev_lock
spinlock.
It is used to set the i_bdev
field to NULL and to reset
i_mapping
when the block device goes away.
The dentries encode the filesystem tree structure, the names of the files. Thus, the main parts of a dentry are the inode (if any) that belongs to it, the name (the final part of the pathname), and the parent (the name of the containing directory). There are also the superblocks, the methods, a list of subdirectories, etc.
struct dentry { struct inode *d_inode; struct dentry *d_parent; struct qstr d_name; struct super_block *d_sb; struct dentry_operations *d_op; struct list_head d_subdirs; ... }
struct dentry_operations { int (*d_revalidate)(struct dentry *, int); int (*d_hash) (struct dentry *, struct qstr *); int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); };
Here the strings are given by
struct qstr { const unsigned char *name; unsigned int len; unsigned int hash; };
Each dentry is on five lists, with links through the fields
d_hash
, d_lru
, d_child
, d_subdirs
,
d_alias
.
Some of these names were badly chosen, and lead to confusion.
We should do a global replace changing d_subdirs
into
d_children
and d_child
into d_sibling
.
The pathname represented by a dentry, is the concatenation of
the name of its parent d_parent
, a slash character,
and its own name d_name
.
However, if the dentry is the root of a mounted filesystem
(i.e., if dentry->d_covers != dentry
), then its pathname
is the pathname of the mount point d_covers
.
Finally, the pathname of the root of the filesystem
(with dentry->d_parent == dentry
) is "/",
and this is also its d_name
.
The d_mounts
and d_covers
fields of a dentry
point back to the dentry itself, except that the d_covers
field
of the dentry for the root of a mounted filesystem points back to
the dentry for the mount point, while the d_mounts
field
of the dentry for the mount point points at the dentry for the root of a
mounted filesystem.
The d_parent
field of a dentry points back to the
dentry for the directory in which it lives. It points back
to the dentry itself in case of the root of a filesystem.
A dentry is called negative if it does not have an associated inode, i.e., if it is a name only.
We see that although a dentry represents a pathname, there may be several dentries for the same pathname, namely when overmounting has taken place. Such dentries have different inodes.
Of course the converse, an inode with several dentries, can also occur.
The above description, with d_mounts
and d_covers
,
was for 2.4. In 2.5 these fields have disappeared, and we only
have the integer d_mounted
that indicates how many
filesystems have been mounted at that point. In case it is
nonzero (this is what d_mountpoint()
tests), a hash
table lookup can find the actual mounted filesystem.
Dentries are used to speed up the lookup operation.
A hash table dentry_hashtable
is used, with an index
that is a hash of the name and the parent. The hash collision
chain has links through the dentry fields d_hash
.
This chain is protected by the spinlock dcache_lock
.
All unused dentries are collected in a list dentry_unused
with links in the dentry fields d_lru
.
This list is protected by the spinlock dcache_lock
.
All subdirectories of a given directory are collected in a list
headed by the dentry field d_subdirs
with links
in the dentry fields d_child
.
These lists are protected by the spinlock dcache_lock
.
All dentries belonging to the same inode are collected in a list
headed by the inode field i_dentry
with links in the dentry fields d_alias
.
This list is protected by the spinlock dcache_lock
.
File structures represent open files, that is, an inode together
with a current (reading/writing) offset. The offset can be set
by the lseek()
system call. Note that instead of a
pointer to the inode we have a pointer to the dentry - that
means that the name used to open a file is known. In particular
system calls like getcwd()
are possible.
struct file { struct dentry *f_dentry; struct vfsmount *f_vfsmnt; struct file_operations *f_op; mode_t f_mode; loff_t f_pos; struct fown_struct f_owner; unsigned int f_uid, f_gid; unsigned long f_version; ... }
Here the f_owner
field gives the owner to use for
async I/O signals.
struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char *, size_t, loff_t); ssize_t (*write) (struct file *, const char *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); };
Each file is in two lists, with links through the fields
f_list
, f_ep_links
.
The list with links through f_list
was discussed
above. It is the list of all files
belonging to a given superblock. There is a second use:
the tty driver collects all files that are opened instances
of a tty in a list headed by tty->tty_files
with
links through the file field f_list
. Conversely,
these files point back at the tty via their field
private_data
.
(This field private_data
is also used elsewhere.
For example, the proc code uses it to attach a struct seq_file
to a file.)
All event poll items belonging to a given file are collected
in a list with head f_ep_links
,
protected by the file field f_ep_lock
.
(For event poll stuff, see epoll_ctl(2).)
A struct vfsmount
describes a mount.
The definition lives in mount.h
:
struct vfsmount { struct list_head mnt_hash; struct vfsmount *mnt_parent; /* fs we are mounted on */ struct dentry *mnt_mountpoint; /* dentry of mountpoint */ struct dentry *mnt_root; /* root of the mounted tree */ struct super_block *mnt_sb; /* pointer to superblock */ struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ atomic_t mnt_count; int mnt_flags; char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; };
Long ago (1.3.46) it was introduced as part of the quota code.
There was a linked list of struct vfsmount
s that
contained a device number, device name, mount point name,
mount flags, superblock pointer, semaphore, file pointers
to quota files and time limits for how long an over-quota situation
would be allowed. Nowadays quota have independent bookkeeping,
and a struct vfsmount
only describes a mount.
These structs are allocated by alloc_vfsmnt()
and
released by free_vfsmnt()
in namespace.c
.
Vfsmounts live in a hash headed by mount_hashtable[]
.
The field mnt_hash
is the link in the collision chain.
This list does not seem to be protected by a lock.
They are put into the hash by attach_mnt()
, found there
by lookup_mnt()
, and removed again by detach_mnt()
,
all from namespace.c
.
Vfsmount for parent.
Dentry for the mountpoint. The pair (mnt_mountpoint, mnt_parent)
(returned by follow_up()
) will be the dentry and vfsmount
for the parent.
Used e.g. in d_path
to return the pathname of a dentry.
Dentry for the root of the mounted tree.
Superblock of the mounted filesystem.
The field mnt_mounts
of a struct vfsmount is the head of
a cyclic list of all submounts (mounts on top of some path
relative to the present mount). The remaining links of this cyclic
list are stored in the mnt_child
fields of its submounting
vfsmounts. (And each of these points back at us with its
mnt_parent
field.)
Used in autofs4/expire.c
and namespace.c
(and nowhere else).
Keep track of users of this structure.
Incremented by mntget
, decremented by mntput
.
Initially 1. It will be 2 for a mount that may be unmounted.
(Autofs also uses this to test whether a tree is busy.)
The mount flags, like MNT_NODEV, MNT_NOEXEC, MNT_NOSUID.
Earlier also MS_RDONLY (that now is stored in sb->s_flags
)
and MNT_VISIBLE (came in 2.4.0-test5, went in 2.4.5)
that told whether this entry should be visible in /proc/mounts
.
Name used in /proc/mounts
.
There was a global cyclic list vfsmntlist
containing all mounts, used only to create the contents of
/proc/mounts
. These days we have per-process namespaces,
and the global vfsmntlist
has been replaced by
current->namespace->list
. This list is ordered
by the order in which the mounts were done, so that one can do
the umounts in reverse order.
The field mnt_list
contains the pointers for this cyclic list.
A struct fs_struct
determines the interpretation
of pathnames referred to by a process (and also, somewhat
illogically, contains the umask). The typical reference
is current->fs
. The definition lives in fs_struct.h
:
struct fs_struct {
atomic_t count;
rwlock_t lock;
int umask;
struct dentry * root, * pwd, * altroot;
struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;
};
Semantics of root
and pwd
are clear.
Remains to discuss altroot
.
In order to support emulation of different operating systems
like BSD and SunOS and Solaris, a small wart has been added
to the walk_init_root
code that finds the root directory
for a name lookup.
The altroot
field of an fs_struct
is usually NULL. It is a function of the personality
and the current root, and the sys_personality
and sys_chroot
system calls call set_fs_altroot()
.
The effect is determined at kernel compile time.
One can define __emul_prefix()
in <asm/namei.h>
as some pathname, say "usr/gnemul/myOS/"
.
The default is NULL, but some architectures have a
definition depending on current->personality
.
If this prefix is non-NULL, and the corresponding file is found,
then set_fs_altroot()
will set the altroot
and altrootmnt
fields of current->fs
to dentry and vfsmnt of that file.
A subsequent lookup of a pathname starting with '/' will now first try to use the altroot. If that fails the usual root is used.
A struct nameidata
represents the result of a lookup.
The definition lives in fs.h
:
struct nameidata {
struct dentry *dentry;
struct vfsmount *mnt;
struct qstr last;
unsigned int flags;
int last_type;
};
The typical use is:
struct nameidata nd;
error = user_path_walk(filename, &nd);
if (!error)
path_release(&nd);
where path_release()
does
dput(nd->dentry);
mntput(nd->mnt);
The core of the routines user_path_walk_link
and user_path_walk
(which call __user_walk
without or with the LOOKUP_FOLLOW
flag) is the
fragment
if (path_init(name, flags, nd))
error = path_walk(name, nd);
So the basic routines handling nameidata
are path_init
and path_walk
. The former finds the start of the walk,
the latter does the walking. (However, the former returns
0 in case it did the walking itself already.)
The routine path_init
initialises the four fields
dentry
, mnt
, flags
, last_type
.
The flags
field was given as an argument, and
dentry
and mnt
are initialised to those
of the current directory or those of the root directory
depending on whether name starts with a '/' or not.
It will always return 1 except in a certain obscure case
discussed below, where the return 0 means that the complete
lookup was done already. (And this case cannot occur for
sys_chroot
, that is why the code there needs not check
the return value.)
(path_init
will always return 1, except when name starts
with a '/', in which case it returns whatever walk_init_root
returns.
walk_init_root
will always return 1, except when
current->fs->altroot
is non-NULL and nd->flags
does not contain LOOKUP_NOALT (for sys_chroot
it does)
and __emul_lookup_dentry
succeeds, which it does when
pathwalk
succeeds - in this case no path_walk
is
required anymore)