1. 24 Sep, 2009 4 commits
  2. 23 Sep, 2009 1 commit
    • Nick Piggin's avatar
      fs: turn iprune_mutex into rwsem · 88e0fbc4
      Nick Piggin authored
      
      We have had a report of bad memory allocation latency during DVD-RAM (UDF)
      writing.  This is causing the user's desktop session to become unusable.
      
      Jan tracked the cause of this down to UDF inode reclaim blocking:
      
      gnome-screens D ffff810006d1d598     0 20686      1
       ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
       ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
       ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
      Call Trace:
       [<ffffffff804477f3>] io_schedule+0x63/0xa5
       [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
       [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
       [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
       [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
       [<ffffffff802c442a>] __bread+0x70/0x86
       [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
       [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
       [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
       [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
       [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
       [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
       [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
       [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
       [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
       [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
       [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
       [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
       [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
       [<ffffffff802822d8>] __do_fault+0x10c/0x417
       [<ffffffff80284232>] handle_mm_fault+0x466/0x940
       [<ffffffff8044b922>] do_page_fault+0x676/0xabf
      
      This blocks with iprune_mutex held, which then blocks other reclaimers:
      
      X             D ffff81009d47c400     0 17285  14831
       ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
       ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
       ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
      Call Trace:
       [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
       [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
       [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
       [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
       [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
       [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
       [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
       [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
       [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
       [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
       [<ffffffff802ad949>] do_select+0x308/0x506
       [<ffffffff802adced>] core_sys_select+0x1a6/0x254
       [<ffffffff802ae0b7>] sys_select+0xb5/0x157
      
      Now I think the main problem is having the filesystem block (and do IO) in
      inode reclaim.  The problem is that this doesn't get accounted well and
      penalizes a random allocator with a big latency spike caused by work
      generated from elsewhere.
      
      I think the best idea would be to avoid this.  By design if possible, or
      by deferring the hard work to an asynchronous context.  If the latter,
      then the fs would probably want to throttle creation of new work with
      queue size of the deferred work, but let's not get into those details.
      
      Anyway, the other obvious thing we looked at is the iprune_mutex which is
      causing the cascading blocking.  We could turn this into an rwsem to
      improve concurrency.  It is unreasonable to totally ban all potentially
      slow or blocking operations in inode reclaim, so I think this is a cheap
      way to get a small improvement.
      
      This doesn't solve the whole problem of course.  The process doing inode
      reclaim will still take the latency hit, and concurrent processes may end
      up contending on filesystem locks.  So fs developers should keep these
      problems in mind.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88e0fbc4
  3. 22 Sep, 2009 2 commits
  4. 16 Sep, 2009 1 commit
  5. 07 Aug, 2009 2 commits
    • Christoph Hellwig's avatar
      vfs: add __destroy_inode · 2e00c97e
      Christoph Hellwig authored
      
      When we want to tear down an inode that lost the add to the cache race
      in XFS we must not call into ->destroy_inode because that would delete
      the inode that won the race from the inode cache radix tree.
      
      This patch provides the __destroy_inode helper needed to fix this,
      the actual fix will be in th next patch.  As XFS was the only reason
      destroy_inode was exported we shift the export to the new __destroy_inode.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarEric Sandeen <sandeen@sandeen.net>
      2e00c97e
    • Christoph Hellwig's avatar
      vfs: fix inode_init_always calling convention · 54e34621
      Christoph Hellwig authored
      
      Currently inode_init_always calls into ->destroy_inode if the additional
      initialization fails.  That's not only counter-intuitive because
      inode_init_always did not allocate the inode structure, but in case of
      XFS it's actively harmful as ->destroy_inode might delete the inode from
      a radix-tree that has never been added.  This in turn might end up
      deleting the inode for the same inum that has been instanciated by
      another process and cause lots of cause subtile problems.
      
      Also in the case of re-initializing a reclaimable inode in XFS it would
      free an inode we still want to keep alive.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarEric Sandeen <sandeen@sandeen.net>
      54e34621
  6. 24 Jun, 2009 1 commit
  7. 22 Jun, 2009 1 commit
    • Jan Kara's avatar
      vfs: Set special lockdep map for dirs only if not set by fs · 9a7aa12f
      Jan Kara authored
      
      Some filesystems need to set lockdep map for i_mutex differently for
      different directories. For example OCFS2 has system directories (for
      orphan inode tracking and for gathering all system files like journal
      or quota files into a single place) which have different locking
      locking rules than standard directories. For a filesystem setting
      lockdep map is naturaly done when the inode is read but we have to
      modify unlock_new_inode() not to overwrite the lockdep map the filesystem
      has set.
      
      Acked-by: peterz@infradead.org
      CC: mingo@redhat.com
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      9a7aa12f
  8. 12 Jun, 2009 2 commits
    • Wolfram Sang's avatar
      trivial: fs/inode: Fix typo in file_update_time nanodoc · 2eadfc0e
      Wolfram Sang authored
      
      The advertised flag for not updating the time was wrong.
      Signed-off-by: default avatarWolfram Sang <w.sang@pengutronix.de>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      2eadfc0e
    • npiggin@suse.de's avatar
      fs: introduce mnt_clone_write · 96029c4e
      npiggin@suse.de authored
      
      This patch speeds up lmbench lat_mmap test by about another 2% after the
      first patch.
      
      Before:
       avg = 462.286
       std = 5.46106
      
      After:
       avg = 453.12
       std = 9.58257
      
      (50 runs of each, stddev gives a reasonable confidence)
      
      It does this by introducing mnt_clone_write, which avoids some heavyweight
      operations of mnt_want_write if called on a vfsmount which we know already
      has a write count; and mnt_want_write_file, which can call mnt_clone_write
      if the file is open for write.
      
      After these two patches, mnt_want_write and mnt_drop_write go from 7% on
      the profile down to 1.3% (including mnt_clone_write).
      
      [AV: mnt_want_write_file() should take file alone and derive mnt from it;
      not only all callers have that form, but that's the only mnt about which
      we know that it's already held for write if file is opened for write]
      
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      96029c4e
  9. 11 Jun, 2009 2 commits
  10. 06 Jun, 2009 2 commits
  11. 09 May, 2009 1 commit
  12. 15 Apr, 2009 1 commit
    • Miklos Szeredi's avatar
      splice: add helpers for locking pipe inode · 61e0d47c
      Miklos Szeredi authored
      
      There are lots of sequences like this, especially in splice code:
      
      	if (pipe->inode)
      		mutex_lock(&pipe->inode->i_mutex);
      	/* do something */
      	if (pipe->inode)
      		mutex_unlock(&pipe->inode->i_mutex);
      
      so introduce helpers which do the conditional locking and unlocking.
      Also replace the inode_double_lock() call with a pipe_double_lock()
      helper to avoid spreading the use of this functionality beyond the
      pipe code.
      
      This patch is just a cleanup, and should cause no behavioral changes.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      61e0d47c
  13. 27 Mar, 2009 1 commit
    • Nick Piggin's avatar
      fs: avoid I_NEW inodes · aabb8fdb
      Nick Piggin authored
      
      To be on the safe side, it should be less fragile to exclude I_NEW inodes
      from inode list scans by default (unless there is an important reason to
      have them).
      
      Normally they will get excluded (eg.  by zero refcount or writecount etc),
      however it is a bit fragile for list walkers to know exactly what parts of
      the inode state is set up and valid to test when in I_NEW.  So along these
      lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
      checks with them too -- this shouldn't be a problem should it?)
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      aabb8fdb
  14. 26 Mar, 2009 2 commits
  15. 12 Mar, 2009 1 commit
    • Nick Piggin's avatar
      fs: new inode i_state corruption fix · 7ef0d737
      Nick Piggin authored
      There was a report of a data corruption
      http://lkml.org/lkml/2008/11/14/121
      
      .  There is a script included to
      reproduce the problem.
      
      During testing, I encountered a number of strange things with ext3, so I
      tried ext2 to attempt to reduce complexity of the problem.  I found that
      fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
      cleared, even though instrumentation showed that unlock_new_inode had
      already been called for that inode.  This points to memory scribble, or
      synchronisation problme.
      
      i_state of I_NEW inodes is not protected by inode_lock because other
      processes are not supposed to touch them until I_LOCK (and I_NEW) is
      cleared.  Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
      i_state revealed that generic_sync_sb_inodes is picking up new inodes from
      the inode lists and passing them to __writeback_single_inode without
      waiting for I_NEW.  Subsequently modifying i_state causes corruption.  In
      my case it would look like this:
      
      CPU0                            CPU1
      unlock_new_inode()              __sync_single_inode()
       reg <- inode->i_state
       reg -> reg & ~(I_LOCK|I_NEW)   reg <- inode->i_state
       reg -> inode->i_state          reg -> reg | I_SYNC
                                      reg -> inode->i_state
      
      Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.
      
      Fix for this is rather than wait for I_NEW inodes, just skip over them:
      inodes concurrently being created are not subject to data integrity
      operations, and should not significantly contribute to dirty memory
      either.
      
      After this change, I'm unable to reproduce any of the added warnings or
      hangs after ~1hour of running.  Previously, the new warnings would start
      immediately and hang would happen in under 5 minutes.
      
      I'm also testing on ext3 now, and so far no problems there either.  I
      don't know whether this fixes the problem reported above, but it fixes a
      real problem for me.
      
      Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
      Reported-by: default avatarAdrian Hunter <ext-adrian.hunter@nokia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ef0d737
  16. 05 Feb, 2009 1 commit
  17. 09 Jan, 2009 1 commit
  18. 07 Jan, 2009 1 commit
  19. 06 Jan, 2009 2 commits
  20. 05 Jan, 2009 1 commit
  21. 31 Dec, 2008 1 commit
    • Al Viro's avatar
      nfsd/create race fixes, infrastructure · 261bca86
      Al Viro authored
      
      new helpers - insert_inode_locked() and insert_inode_locked4().
      Hash new inode, making sure that there's no such inode in icache
      already.  If there is and it does not end up unhashed (as would
      happen if we have nfsd trying to resolve a bogus fhandle), fail.
      Otherwise insert our inode into hash and succeed.
      
      In either case have i_state set to new+locked; cleanup ends up
      being simpler with such calling conventions.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      261bca86
  22. 10 Nov, 2008 1 commit
  23. 30 Oct, 2008 3 commits
  24. 15 Aug, 2008 1 commit
    • Chris Mason's avatar
      fs/inode.c: properly init address_space->writeback_index · 7d455e00
      Chris Mason authored
      
      write_cache_pages() uses i_mapping->writeback_index to pick up where it
      left off the last time a given inode was found by pdflush or
      balance_dirty_pages (or anyone else who sets wbc->range_cyclic)
      
      alloc_inode() should set it to a sane value so that writeback doesn't
      start in the middle of a file.  It is somewhat difficult to notice the bug
      since write_cache_pages will loop around to the start of the file and the
      elevator helps hide the resulting seeks.
      
      For whatever reason, Btrfs hits this often.  Unpatched, untarring 30
      copies of the linux kernel in series runs at 47MB/s on a single sata
      drive.  With this fix, it jumps to 62MB/s.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d455e00
  25. 26 Jul, 2008 2 commits
  26. 06 May, 2008 2 commits