1. 25 May, 2011 3 commits
  2. 19 May, 2011 1 commit
  3. 10 May, 2011 1 commit
  4. 31 Mar, 2011 1 commit
  5. 25 Mar, 2011 1 commit
    • Dave Chinner's avatar
      xfs: introduce inode cluster buffer trylocks for xfs_iflush · 1bfd8d04
      Dave Chinner authored
      
      There is an ABBA deadlock between synchronous inode flushing in
      xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
      buffer, then takes inode ilocks, whilst synchronous reclaim takes
      the ilock followed by the buffer lock in xfs_iflush().
      
      To avoid this deadlock, separate the inode cluster buffer locking
      semantics from the synchronous inode flush semantics, allowing
      callers to attempt to lock the buffer but still issue synchronous IO
      if it can get the buffer. This requires xfs_iflush() calls that
      currently use non-blocking semantics to pass SYNC_TRYLOCK rather
      than 0 as the flags parameter.
      
      This allows xfs_reclaim_inode to avoid the deadlock on the buffer
      lock and detect the failure so that it can drop the inode ilock and
      restart the reclaim attempt on the inode. This allows
      xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
      release it and hence defuse the deadlock situation. It also has the
      pleasant side effect of avoiding IO in xfs_reclaim_inode when it
      tries to next reclaim the inode as it is now marked stale.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      1bfd8d04
  6. 06 Mar, 2011 4 commits
  7. 23 Feb, 2011 1 commit
    • Christoph Hellwig's avatar
      xfs: more sensible inode refcounting for ialloc · ec3ba85f
      Christoph Hellwig authored
      
      Currently we return iodes from xfs_ialloc with just a single reference held.
      But we need two references, as one is dropped during transaction commit and
      the second needs to be transfered to the VFS.  Change xfs_ialloc to use
      xfs_iget plus xfs_trans_ijoin_ref to grab two references to the inode,
      and remove the now superflous IHOLD calls from all callers.  This also
      greatly simplifies the error handling in xfs_create and also allow to remove
      xfs_trans_iget as no other callers are left.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      ec3ba85f
  8. 02 Dec, 2010 1 commit
  9. 17 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: convert inode cache lookups to use RCU locking · 1a3e8f3d
      Dave Chinner authored
      
      With delayed logging greatly increasing the sustained parallelism of inode
      operations, the inode cache locking is showing significant read vs write
      contention when inode reclaim runs at the same time as lookups. There is
      also a lot more write lock acquistions than there are read locks (4:1 ratio)
      so the read locking is not really buying us much in the way of parallelism.
      
      To avoid the read vs write contention, change the cache to use RCU locking on
      the read side. To avoid needing to RCU free every single inode, use the built
      in slab RCU freeing mechanism. This requires us to be able to detect lookups of
      freed inodes, so enѕure that ever freed inode has an inode number of zero and
      the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
      lookup path, but also add a check for a zero inode number as well.
      
      We canthen convert all the read locking lockups to use RCU read side locking
      and hence remove all read side locking.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      1a3e8f3d
  10. 18 Oct, 2010 3 commits
  11. 24 Aug, 2010 1 commit
    • Dave Chinner's avatar
      xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE · 5b3eed75
      Dave Chinner authored
      
      Under heavy load parallel metadata loads (e.g. dbench), we can fail
      to mark all the inodes in a cluster being freed as XFS_ISTALE as we
      skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on.
      When this happens and the inode cluster buffer has already been
      marked stale and freed, inode reclaim can try to write the inode out
      as it is dirty and not marked stale. This can result in writing th
      metadata to an freed extent, or in the case it has already
      been overwritten trigger a magic number check failure and return an
      EUCLEAN error such as:
      
      Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117
      
      Fix this by ensuring that we hoover up all in memory inodes in the
      cluster and mark them XFS_ISTALE when freeing the cluster.
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5b3eed75
  12. 26 Jul, 2010 9 commits
  13. 24 Jun, 2010 2 commits
  14. 03 Jun, 2010 1 commit
    • Dave Chinner's avatar
      xfs: fix race in inode cluster freeing failing to stale inodes · 5b257b4a
      Dave Chinner authored
      
      When an inode cluster is freed, it needs to mark all inodes in memory as
      XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
      have a different life cycle to the buffer, and once the buffer is torn down
      during transaction completion, we must ensure none of the inodes get written
      back (which is what XFS_ISTALE does).
      
      Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
      marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
      inodes either during inode reclaim or tail pushing on the AIL.  The buffer is
      read back, but no longer contains inodes and so triggers assert failures and
      shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.
      
      There are two main causes of xfs_ifree_cluster() failing. The first is simple -
      it checks in-memory inodes it finds in the per-ag icache to see if they are
      clean without holding the flush lock. if they are clean it skips them
      completely. However, If an inode is flushed delwri, it will
      appear clean, but is not guaranteed to be written back until the flush lock has
      been dropped. Hence we may have raced on the clean check and the inode may
      actually be dirty. Hence always mark inodes found in memory stale before we
      check properly if they are clean.
      
      The second is more complex, and makes the first problem easier to hit.
      Basically the in-memory inode scan is done with full knowledge it can be racing
      with inode flushing and AIl tail pushing, which means that inodes that it can't
      get the flush lock on might not be attached to the buffer after then in-memory
      inode scan due to IO completion occurring. This is actually documented in the
      code as "needs better interlocking". i.e. this is a zero-day bug.
      
      Effectively, the in-memory scan must be done while the inode buffer is locked
      and Io cannot be issued on it while we do the in-memory inode scan. This
      ensures that inodes we couldn't get the flush lock on are guaranteed to be
      attached to the cluster buffer, so we can then catch all in-memory inodes and
      mark them stale.
      
      Now that the inode cluster buffer is locked before the in-memory scan is done,
      there is no need for the two-phase update of the in-memory inodes, so simplify
      the code into two loops and remove the allocation of the temporary buffer used
      to hold locked inodes across the phases.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5b257b4a
  15. 28 May, 2010 1 commit
    • Christoph Hellwig's avatar
      xfs: fix access to upper inodes without inode64 · fb3b504a
      Christoph Hellwig authored
      
      If a filesystem is mounted without the inode64 mount option we
      should still be able to access inodes not fitting into 32 bits, just
      not created new ones.  For this to work we need to make sure the
      inode cache radix tree is initialized for all allocation groups, not
      just those we plan to allocate inodes from.  This patch makes sure
      we initialize the inode cache radix tree for all allocation groups,
      and also cleans xfs_initialize_perag up a bit to separate the
      inode32 logical from the general perag structure setup.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      fb3b504a
  16. 19 May, 2010 1 commit
  17. 01 Mar, 2010 2 commits
  18. 06 Feb, 2010 2 commits
    • Dave Chinner's avatar
      xfs: Use delayed write for inodes rather than async V2 · c854363e
      Dave Chinner authored
      
      We currently do background inode flush asynchronously, resulting in
      inodes being written in whatever order the background writeback
      issues them. Not only that, there are also blocking and non-blocking
      asynchronous inode flushes, depending on where the flush comes from.
      
      This patch completely removes asynchronous inode writeback. It
      removes all the strange writeback modes and replaces them with
      either a synchronous flush or a non-blocking delayed write flush.
      That is, inode flushes will only issue IO directly if they are
      synchronous, and background flushing may do nothing if the operation
      would block (e.g. on a pinned inode or buffer lock).
      
      Delayed write flushes will now result in the inode buffer sitting in
      the delwri queue of the buffer cache to be flushed by either an AIL
      push or by the xfsbufd timing out the buffer. This will allow
      accumulation of dirty inode buffers in memory and allow optimisation
      of inode cluster writeback at the xfsbufd level where we have much
      greater queue depths than the block layer elevators. We will also
      get adjacent inode cluster buffer IO merging for free when a later
      patch in the series allows sorting of the delayed write buffers
      before dispatch.
      
      This effectively means that any inode that is written back by
      background writeback will be seen as flush locked during AIL
      pushing, and will result in the buffers being pushed from there.
      This writeback path is currently non-optimal, but the next patch
      in the series will fix that problem.
      
      A side effect of this delayed write mechanism is that background
      inode reclaim will no longer directly flush inodes, nor can it wait
      on the flush lock. The result is that inode reclaim must leave the
      inode in the reclaimable state until it is clean. Hence attempts to
      reclaim a dirty inode in the background will simply skip the inode
      until it is clean and this allows other mechanisms (i.e. xfsbufd) to
      do more optimal writeback of the dirty buffers. As a result, the
      inode reclaim code has been rewritten so that it no longer relies on
      the ambiguous return values of xfs_iflush() to determine whether it
      is safe to reclaim an inode.
      
      Portions of this patch are derived from patches by Christoph
      Hellwig.
      
      Version 2:
      - cleanup reclaim code as suggested by Christoph
      - log background reclaim inode flush errors
      - just pass sync flags to xfs_iflush
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c854363e
    • Dave Chinner's avatar
      xfs: Make inode reclaim states explicit · 777df5af
      Dave Chinner authored
      
      A.K.A.: don't rely on xfs_iflush() return value in reclaim
      
      We have gradually been moving checks out of the reclaim code because
      they are duplicated in xfs_iflush(). We've had a history of problems
      in this area, and many of them stem from the overloading of the
      return values from xfs_iflush() and interaction with inode flush
      locking to determine if the inode is safe to reclaim.
      
      With the desire to move to delayed write flushing of inodes and
      non-blocking inode tree reclaim walks, the overloading of the
      return value of xfs_iflush makes it very difficult to determine
      the correct thing to do next.
      
      This patch explicitly re-adds the checks to the inode reclaim code,
      removing the reliance on the return value of xfs_iflush() to
      determine what to do next. It also means that we can clearly
      document all the inode states that reclaim must handle and hence
      we can easily see that we handled all the necessary cases.
      
      This also removes the need for the xfs_inode_clean() check in
      xfs_iflush() as all callers now check this first (safely).
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      777df5af
  19. 21 Jan, 2010 2 commits
  20. 15 Jan, 2010 2 commits