1. 11 Mar, 2006 1 commit
  2. 08 Mar, 2006 2 commits
    • Dipankar Sarma's avatar
      [PATCH] fix file counting · 529bf6be
      Dipankar Sarma authored
      
      I have benchmarked this on an x86_64 NUMA system and see no significant
      performance difference on kernbench.  Tested on both x86_64 and powerpc.
      
      The way we do file struct accounting is not very suitable for batched
      freeing.  For scalability reasons, file accounting was
      constructor/destructor based.  This meant that nr_files was decremented
      only when the object was removed from the slab cache.  This is susceptible
      to slab fragmentation.  With RCU based file structure, consequent batched
      freeing and a test program like Serge's, we just speed this up and end up
      with a very fragmented slab -
      
      llm22:~ # cat /proc/sys/fs/file-nr
      587730  0       758844
      
      At the same time, I see only a 2000+ objects in filp cache.  The following
      patch I fixes this problem.
      
      This patch changes the file counting by removing the filp_count_lock.
      Instead we use a separate percpu counter, nr_files, for now and all
      accesses to it are through get_nr_files() api.  In the sysctl handler for
      nr_files, we populate files_stat.nr_files before returning to user.
      
      Counting files as an when they are created and destroyed (as opposed to
      inside slab) allows us to correctly count open files with RCU.
      Signed-off-by: default avatarDipankar Sarma <dipankar@in.ibm.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      529bf6be
    • Linus Torvalds's avatar
      Mark the pipe file operations static · a19cbd4b
      Linus Torvalds authored
      
      They aren't used (nor even really usable) outside of pipe.c anyway
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a19cbd4b
  3. 01 Feb, 2006 1 commit
  4. 19 Jan, 2006 1 commit
    • Ulrich Drepper's avatar
      [PATCH] vfs: *at functions: core · 5590ff0d
      Ulrich Drepper authored
      
      Here is a series of patches which introduce in total 13 new system calls
      which take a file descriptor/filename pair instead of a single file
      name.  These functions, openat etc, have been discussed on numerous
      occasions.  They are needed to implement race-free filesystem traversal,
      they are necessary to implement a virtual per-thread current working
      directory (think multi-threaded backup software), etc.
      
      We have in glibc today implementations of the interfaces which use the
      /proc/self/fd magic.  But this code is rather expensive.  Here are some
      results (similar to what Jim Meyering posted before).
      
      The test creates a deep directory hierarchy on a tmpfs filesystem.  Then
      rm -fr is used to remove all directories.  Without syscall support I get
      this:
      
      real    0m31.921s
      user    0m0.688s
      sys     0m31.234s
      
      With syscall support the results are much better:
      
      real    0m20.699s
      user    0m0.536s
      sys     0m20.149s
      
      The interfaces are for obvious reasons currently not much used.  But they'll
      be used.  coreutils (and Jeff's posixutils) are already using them.
      Furthermore, code like ftw/fts in libc (maybe even glob) will also start using
      them.  I expect a patch to make follow soon.  Every program which is walking
      the filesystem tree will benefit.
      Signed-off-by: default avatarUlrich Drepper <drepper@redhat.com>
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5590ff0d
  5. 17 Jan, 2006 1 commit
  6. 15 Jan, 2006 1 commit
  7. 10 Jan, 2006 3 commits
  8. 09 Jan, 2006 7 commits
  9. 06 Jan, 2006 2 commits
    • J. Bruce Fields's avatar
      NLM: Further cancel fixes · 64a318ee
      J. Bruce Fields authored
      
       If the server receives an NLM cancel call and finds no waiting lock to
       cancel, then chances are the lock has already been applied, and the client
       just hadn't yet processed the NLM granted callback before it sent the
       cancel.
      
       The Open Group text, for example, perimts a server to return either success
       (LCK_GRANTED) or failure (LCK_DENIED) in this case.  But returning an error
       seems more helpful; the client may be able to use it to recognize that a
       race has occurred and to recover from the race.
      
       So, modify the relevant functions to return an error in this case.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      64a318ee
    • Badari Pulavarty's avatar
      [PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store · f6b3ec23
      Badari Pulavarty authored
      
      Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
      given range of pages & its associated backing store.  Current
      implementation supports only shmfs/tmpfs and other filesystems return
      -ENOSYS.
      
      "Some app allocates large tmpfs files, then when some task quits and some
      client disconnect, some memory can be released.  However the only way to
      release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli
      
      Databases want to use this feature to drop a section of their bufferpool
      (shared memory segments) - without writing back to disk/swap space.
      
      This feature is also useful for supporting hot-plug memory on UML.
      
      Concerns raised by Andrew Morton:
      
      - "We have no plan for holepunching!  If we _do_ have such a plan (or
        might in the future) then what would the API look like?  I think
        sys_holepunch(fd, start, len), so we should start out with that."
      
      - Using madvise is very weird, because people will ask "why do I need to
        mmap my file before I can stick a hole in it?"
      
      - None of the other madvise operations call into the filesystem in this
        manner.  A broad question is: is this capability an MM operation or a
        filesytem operation?  truncate, for example, is a filesystem operation
        which sometimes has MM side-effects.  madvise is an mm operation and with
        this patch, it gains FS side-effects, only they're really, really
        significant ones."
      
      Comments:
      
      - Andrea suggested the fs operation too but then it's more efficient to
        have it as a mm operation with fs side effects, because they don't
        immediatly know fd and physical offset of the range.  It's possible to
        fixup in userland and to use the fs operation but it's more expensive,
        the vmas are already in the kernel and we can use them.
      
      Short term plan &  Future Direction:
      
      - We seem to need this interface only for shmfs/tmpfs files in the short
        term.  We have to add hooks into the filesystem for correctness and
        completeness.  This is what this patch does.
      
      - In the future, plan is to support both fs and mmap apis also.  This
        also involves (other) filesystem specific functions to be implemented.
      
      - Current patch doesn't support VM_NONLINEAR - which can be addressed in
        the future.
      Signed-off-by: default avatarBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f6b3ec23
  10. 03 Jan, 2006 1 commit
    • Zach Brown's avatar
      [PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE · 994fc28c
      Zach Brown authored
      
      readpage(), prepare_write(), and commit_write() callers are updated to
      understand the special return code AOP_TRUNCATED_PAGE in the style of
      writepage() and WRITEPAGE_ACTIVATE.  AOP_TRUNCATED_PAGE tells the caller that
      the callee has unlocked the page and that the operation should be tried again
      with a new page.  OCFS2 uses this to detect and work around a lock inversion in
      its aop methods.  There should be no change in behaviour for methods that don't
      return AOP_TRUNCATED_PAGE.
      
      WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
      made enums so that kerneldoc can be used to document their semantics.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      994fc28c
  11. 09 Nov, 2005 2 commits
  12. 08 Nov, 2005 6 commits
  13. 07 Nov, 2005 2 commits
  14. 31 Oct, 2005 1 commit
  15. 28 Oct, 2005 1 commit
    • Al Viro's avatar
      [PATCH] gfp_t: fs/* · 27496a8c
      Al Viro authored
      
       - ->releasepage() annotated (s/int/gfp_t), instances updated
       - missing gfp_t in fs/* added
       - fixed misannotation from the original sweep caught by bitwise checks:
         XFS used __nocast both for gfp_t and for flags used by XFS allocator.
         The latter left with unsigned int __nocast; we might want to add a
         different type for those but for now let's leave them alone.  That,
         BTW, is a case when __nocast use had been actively confusing - it had
         been used in the same code for two different and similar types, with
         no way to catch misuses.  Switch of gfp_t to bitwise had caught that
         immediately...
      
      One tricky bit is left alone to be dealt with later - mapping->flags is
      a mix of gfp_t and error indications.  Left alone for now.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      27496a8c
  16. 10 Sep, 2005 1 commit
  17. 09 Sep, 2005 1 commit
    • Dipankar Sarma's avatar
      [PATCH] files: files struct with RCU · ab2af1f5
      Dipankar Sarma authored
      
      Patch to eliminate struct files_struct.file_lock spinlock on the reader side
      and use rcu refcounting rcuref_xxx api for the f_count refcounter.  The
      updates to the fdtable are done by allocating a new fdtable structure and
      setting files->fdt to point to the new structure.  The fdtable structure is
      protected by RCU thereby allowing lock-free lookup.  For fd arrays/sets that
      are vmalloced, we use keventd to free them since RCU callbacks can't sleep.  A
      global list of fdtable to be freed is not scalable, so we use a per-cpu list.
      If keventd is already handling the current cpu's work, we use a timer to defer
      queueing of that work.
      
      Since the last publication, this patch has been re-written to avoid using
      explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
      premitives instead.  This required that the fd information is kept in a
      separate structure (fdtable) and updated atomically.
      Signed-off-by: default avatarDipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ab2af1f5
  18. 07 Sep, 2005 4 commits
  19. 20 Aug, 2005 1 commit
    • Linus Torvalds's avatar
      Fix nasty ncpfs symlink handling bug. · cc314eef
      Linus Torvalds authored
      
      This bug could cause oopses and page state corruption, because ncpfs
      used the generic page-cache symlink handlign functions.  But those
      functions only work if the page cache is guaranteed to be "stable", ie a
      page that was installed when the symlink walk was started has to still
      be installed in the page cache at the end of the walk.
      
      We could have fixed ncpfs to not use the generic helper routines, but it
      is in many ways much cleaner to instead improve on the symlink walking
      helper routines so that they don't require that absolute stability.
      
      We do this by allowing "follow_link()" to return a error-pointer as a
      cookie, which is fed back to the cleanup "put_link()" routine.  This
      also simplifies NFS symlink handling.
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      cc314eef
  20. 27 Jul, 2005 1 commit
    • Peter Staubach's avatar
      [PATCH] stale POSIX lock handling · c293621b
      Peter Staubach authored
      
      I believe that there is a problem with the handling of POSIX locks, which
      the attached patch should address.
      
      The problem appears to be a race between fcntl(2) and close(2).  A
      multithreaded application could close a file descriptor at the same time as
      it is trying to acquire a lock using the same file descriptor.  I would
      suggest that that multithreaded application is not providing the proper
      synchronization for itself, but the OS should still behave correctly.
      
      SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
      a file descriptor is closed, that all POSIX locks on the file, owned by the
      process which closed the file descriptor, should be released.
      
      The trick here is when those locks are released.  The current code releases
      all locks which exist when close is processing, but any locks in progress
      are handled when the last reference to the open file is released.
      
      There are three cases to consider.
      
      One is the simple case, a multithreaded (mt) process has a file open and
      races to close it and acquire a lock on it.  In this case, the close will
      release one reference to the open file and when the fcntl is done, it will
      release the other reference.  For this situation, no locks should exist on
      the file when both the close and fcntl operations are done.  The current
      system will handle this case because the last reference to the open file is
      being released.
      
      The second case is when the mt process has dup(2)'d the file descriptor.
      The close will release one reference to the file and the fcntl, when done,
      will release another, but there will still be at least one more reference
      to the open file.  One could argue that the existence of a lock on the file
      after the close has completed is okay, because it was acquired after the
      close operation and there is still a way for the application to release the
      lock on the file, using an existing file descriptor.
      
      The third case is when the mt process has forked, after opening the file
      and either before or after becoming an mt process.  In this case, each
      process would hold a reference to the open file.  For each process, this
      degenerates to first case above.  However, the lock continues to exist
      until both processes have released their references to the open file.  This
      lock could block other lock requests.
      
      The changes to release the lock when the last reference to the open file
      aren't quite right because they would allow the lock to exist as long as
      there was a reference to the open file.  This is too long.
      
      The new proposed solution is to add support in the fcntl code path to
      detect a race with close and then to release the lock which was just
      acquired when such as race is detected.  This causes locks to be released
      in a timely fashion and for the system to conform to the POSIX semantic
      specification.
      
      This was tested by instrumenting a kernel to detect the handling locks and
      then running a program which generates case #3 above.  A dangling lock
      could be reliably generated.  When the changes to detect the close/fcntl
      race were added, a dangling lock could no longer be generated.
      
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c293621b