Commits · fb045adb99d9b7c562dc7fef834857f78249daa1 · Bricked / flo

07 Jan, 2011 6 commits

fs: dcache reduce branches in lookup path · fb045adb

Nick Piggin authored 14 years ago

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

fb045adb

fs: fs_struct use seqlock · c28cc364

Nick Piggin authored 14 years ago


Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.

Multi-threaded apps may now perform path lookups with scalability matching
multi-process apps. Operations such as stat(2) become very scalable for
multi-threaded workload.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

c28cc364

fs: rcu-walk for path lookup · 31e6b01f

Nick Piggin authored 14 years ago


Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

31e6b01f

fs: dcache remove dcache_lock · b5c84bf6

Nick Piggin authored 14 years ago


dcache_lock no longer protects anything. remove it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

b5c84bf6

fs: dcache scale dentry refcount · b7ab39f6

Nick Piggin authored 14 years ago

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

b7ab39f6

fs: change d_hash for rcu-walk · b1e6a015

Nick Piggin authored 14 years ago


Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

b1e6a015

07 Dec, 2010 1 commit

fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called · b1085ba8

Lino Sanfilippo authored 14 years ago


Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
checks, so a user can still disable permission checks by setting this flag
in an open() call.
This patch corrects this by unsetting the flag before fsnotify_perm is called.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>

b1085ba8

29 Oct, 2010 1 commit

fix open/umount race · d893f1bc

Al Viro authored 14 years ago


nameidata_to_filp() drops nd->path or transfers it to opened
file.  In the former case it's a Bad Idea(tm) to do mnt_drop_write()
on nd->path.mnt, since we might race with umount and vfsmount in
question might be gone already.

Fix: don't drop it, then...  IOW, have nameidata_to_filp() grab nd->path
in case it transfers it to file and do path_drop() in callers.  After
they are through with accessing nd->path...
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d893f1bc

26 Oct, 2010 2 commits

new helper: ihold() · 7de9c6ee

Al Viro authored 14 years ago


Clones an existing reference to inode; caller must already hold one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7de9c6ee

fs: move permission check back into __lookup_hash · 81fca444

Christoph Hellwig authored 14 years ago


The caller that didn't need it is gone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

81fca444

18 Aug, 2010 4 commits

fs: brlock vfsmount_lock · 99b7db7b

Nick Piggin authored 14 years ago


fs: brlock vfsmount_lock

Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.

A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.

The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).

The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

99b7db7b

fs: remove extra lookup in __lookup_hash · b04f784e

Nick Piggin authored 14 years ago


fs: remove extra lookup in __lookup_hash

Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.

Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:

- We are interested in how the RCU lookup works here, particularly with
  renames. Make that explicit, and point to the document where it is explained
  in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
  If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
  code used to work, more about the interesting parts of how it works now. So
  comments about lazy LRU may be interesting, but would better be done in the
  LRU or refcount management code.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b04f784e

fs: dentry allocation consolidation · baa03890

Nick Piggin authored 14 years ago


fs: dentry allocation consolidation

There are 2 duplicate copies of code in dentry allocation in path lookup.
Consolidate them into a single function.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

baa03890

fs: fix do_lookup false negative · 2e2e88ea

Nick Piggin authored 14 years ago


fs: fix do_lookup false negative

In do_lookup, if we initially find no dentry, we take the directory i_mutex and
re-check the lookup. If we find a dentry there, then we revalidate it if
needed. However if that revalidate asks for the dentry to be invalidated, we
return -ENOENT from do_lookup. What should happen instead is an attempt to
allocate and lookup a new dentry.

This is probably not noticed because it is rare. It is only reached if a
concurrent create races in first (in which case, the dentry probably won't be
invalidated anyway), or if the racy __d_lookup has failed due to a
false-negative (which is very rare).

Fix this by removing code and have it use the normal reval path.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2e2e88ea

11 Aug, 2010 1 commit

vfs: add helpers to get root and pwd · f7ad3c6b

Miklos Szeredi authored 14 years ago


Add three helpers that retrieve a refcounted copy of the root and cwd
from the supplied fs_struct.

 get_fs_root()
 get_fs_pwd()
 get_fs_root_and_pwd()
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f7ad3c6b

02 Aug, 2010 2 commits

security: make LSMs explicitly mask off permissions · d09ca739

Eric Paris authored 14 years ago


SELinux needs to pass the MAY_ACCESS flag so it can handle auditting
correctly.  Presently the masking of MAY_* flags is done in the VFS.  In
order to allow LSMs to decide what flags they care about and what flags
they don't just pass them all and the each LSM mask off what they don't
need.  This patch should contain no functional changes to either the VFS or
any LSM.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>

d09ca739

LSM: Remove unused arguments from security_path_truncate(). · ea0d3ab2

Tetsuo Handa authored 14 years ago

When commit be6d3e56

 "introduce new LSM hooks
where vfsmount is available." was proposed, regarding security_path_truncate(),
only "struct file *" argument (which AppArmor wanted to use) was removed.
But length and time_attrs arguments are not used by TOMOYO nor AppArmor.
Thus, let's remove these arguments.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: James Morris <jmorris@namei.org>

ea0d3ab2

28 Jul, 2010 1 commit

fsnotify: use unsigned char * for dentry->d_name.name · 59b0df21

Eric Paris authored 15 years ago


fsnotify was using char * when it passed around the d_name.name string
internally but it is actually an unsigned char *.  This patch switches
fsnotify to use unsigned and should silence some pointer signess warnings
which have popped out of xfs.  I do not add -Wpointer-sign to the fsnotify
code as there are still issues with kstrdup and strlen which would pop
out needless warnings.
Signed-off-by: Eric Paris <eparis@redhat.com>

59b0df21

28 May, 2010 1 commit

VFS: fix recent breakage of FS_REVAL_DOT · 176306f5

Neil Brown authored 14 years ago

Commit 1f36f774

 broke FS_REVAL_DOT semantics.

In particular, before this patch, the command
   ls -l
in an NFS mounted directory would always check if the directory on the server
had changed and if so would flush and refill the pagecache for the dir.
After this patch, the same "ls -l" will repeatedly return stale date until
the cached attributes for the directory time out.

The following patch fixes this by ensuring the d_revalidate is called by
do_last when "." is being looked-up.
link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
is not set so nfs_lookup_verify_inode chooses not to do any validation.

The following patch restores the original behaviour.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

176306f5

21 May, 2010 1 commit

namei.c : update mnt when it needed · 9a229683

Huang Shijie authored 15 years ago


update the mnt of the path when it is not equal to the new one.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9a229683

15 May, 2010 1 commit

Fix the regression created by "set S_DEAD on unlink()..." commit · d83c49f3

Al Viro authored 14 years ago


1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on.  To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question.  Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD.  Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.

2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint.  Fixed.

3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one.  Noticed and fixed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d83c49f3

13 May, 2010 1 commit

vfs: Fix O_NOFOLLOW behavior for paths with trailing slashes · 002baeec

Jan Kara authored 14 years ago


According to specification

	mkdir d; ln -s d a; open("a/", O_NOFOLLOW | O_RDONLY)

should return success but currently it returns ELOOP.  This is a
regression caused by path lookup cleanup patch series.

Fix the code to ignore O_NOFOLLOW in case the provided path has trailing
slashes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

002baeec

26 Mar, 2010 1 commit

Restore LOOKUP_DIRECTORY hint handling in final lookup on open() · 3e297b61

Al Viro authored 15 years ago


	Lose want_dir argument, while we are at it - since now
nd->flags & LOOKUP_DIRECTORY is equivalent to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3e297b61

06 Mar, 2010 1 commit

Fix a dumb typo - use of & instead of && · 781b1677

Al Viro authored 15 years ago

We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774

 ("Switch !O_CREAT case to use of do_last()")
Reported-by: Walter Sheets <w41ter@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

781b1677

05 Mar, 2010 16 commits

Switch !O_CREAT case to use of do_last() · 1f36f774

Al Viro authored 15 years ago


... and now we have all intents crap well localized
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1f36f774

Get rid of symlink body copying · def4af30

Al Viro authored 15 years ago


Now that nd->last stays around until ->put_link() is called, we can
just postpone that ->put_link() in do_filp_open() a bit and don't
bother with copying.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

def4af30

Finish pulling of -ESTALE handling to upper level in do_filp_open() · 3866248e

Al Viro authored 15 years ago


Don't bother with path_walk() (and its retry loop); link_path_walk()
will do it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3866248e

Turn do_link spaghetty into a normal loop · 806b681c
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
806b681c
Unify exits in O_CREAT handling · 10fa8e62
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
10fa8e62

Kill is_link argument of do_last() · 9e67f361

Al Viro authored 15 years ago


We set it to 1 iff we return NULL
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9e67f361

Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open() · 67ee3ad2

Al Viro authored 15 years ago


Note that in case of !O_CREAT we know that nd.root has already been given up
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

67ee3ad2

Leave mangled flag only for setting nd.intent.open.flag · 4296e2cb
Al Viro authored 15 years ago
```
Nothing else uses it anymore
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
4296e2cb
Get rid of passing mangled flag to do_last() · 5b369df8
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
5b369df8
Don't pass mangled open_flag to finish_open() · 9a66179e
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
9a66179e

pull more into do_last() · a2c36b45

Al Viro authored 15 years ago


Handling of LAST_DOT/LAST_ROOT/LAST_DOTDOT/terminating slash
can be pulled in as well
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a2c36b45

bail out with ELOOP earlier in do_link loop · c99658fe

Al Viro authored 15 years ago


If we'd passed through 32 trailing symlinks already, there's
no sense following the 33rd - we'll bail out anyway.  Better
bugger off earlier.

It *does* change behaviour, after a fashion - if the 33rd happens
to be a procfs-style symlink, original code *would* allow it.
This one will not.  Cry me a river if that hurts you.  Please, do.
And post a video of that, while you are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c99658fe

pull the common predecessors into do_last() · a1e28038
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
a1e28038

postpone __putname() until after do_last() · c41c1405

Al Viro authored 15 years ago


Since do_last() doesn't mangle nd->last_name, we can safely postpone
__putname() done in handling of trailing symlinks until after the
call of do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c41c1405

unroll do_last: loop in do_filp_open() · 27bff343
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
27bff343
Shift releasing nd->root from do_last() to its caller · 3343eb82
Al Viro authored 15 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
3343eb82