1. 04 Nov, 2009 1 commit
  2. 14 Sep, 2009 1 commit
    • Jan Kara's avatar
      vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode · 148f948b
      Jan Kara authored
      
      Introduce new function for generic inode syncing (vfs_fsync_range) and use
      it from fsync() path. Introduce also new helper for syncing after a sync
      write (generic_write_sync) using the generic function.
      
      Use these new helpers for syncing from generic VFS functions. This makes
      O_SYNC writes to block devices acquire i_mutex for syncing. If we really
      care about this, we can make block_fsync() drop the i_mutex and reacquire
      it before it returns.
      
      CC: Evgeniy Polyakov <zbr@ioremap.net>
      CC: ocfs2-devel@oss.oracle.com
      CC: Joel Becker <joel.becker@oracle.com>
      CC: Felix Blyakher <felixb@sgi.com>
      CC: xfs@oss.sgi.com
      CC: Anton Altaparmakov <aia21@cantab.net>
      CC: linux-ntfs-dev@lists.sourceforge.net
      CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      CC: linux-ext4@vger.kernel.org
      CC: tytso@mit.edu
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      148f948b
  3. 11 Sep, 2009 1 commit
  4. 19 May, 2009 1 commit
    • Miklos Szeredi's avatar
      splice: fix kmaps in default_file_splice_write() · b2858d7d
      Miklos Szeredi authored
      
      Unfortunately multiple kmap() within a single thread are deadlockable,
      so writing out multiple buffers with writev() isn't possible.
      
      Change the implementation so that it does a separate write() for each
      buffer.  This actually simplifies the code a lot since the
      splice_from_pipe() helper can be used.
      
      This limitation is caused by HIGHMEM pages, and so only affects a
      subset of architectures and configurations.  In the future it may be
      worth to implement default_file_splice_write() in a more efficient way
      on configs that allow it.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      b2858d7d
  5. 14 May, 2009 1 commit
  6. 13 May, 2009 1 commit
  7. 11 May, 2009 3 commits
    • Miklos Szeredi's avatar
      splice: implement default splice_write method · 0b0a47f5
      Miklos Szeredi authored
      
      If f_op->splice_write() is not implemented, fall back to a plain write.
      Use vfs_writev() to write from the pipe buffers.
      
      This will allow splice on all filesystems and file types.  This
      includes "direct_io" files in fuse which bypass the page cache.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      0b0a47f5
    • Miklos Szeredi's avatar
      splice: implement default splice_read method · 6818173b
      Miklos Szeredi authored
      
      If f_op->splice_read() is not implemented, fall back to a plain read.
      Use vfs_readv() to read into previously allocated pages.
      
      This will allow splice and functions using splice, such as the loop
      device, to work on all filesystems.  This includes "direct_io" files
      in fuse which bypass the page cache.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      6818173b
    • Miklos Szeredi's avatar
      splice: implement pipe to pipe splicing · 7c77f0b3
      Miklos Szeredi authored
      
      Allow splice(2) to work when both the input and the output is a pipe.
      
      Based on the impementation of the tee(2) syscall, but instead of
      duplicating the buffer references move the buffers from the input pipe
      to the output pipe.
      
      Moving the whole buffer only succeeds if the full length of the buffer
      is spliced.  Otherwise duplicate the buffer, just like tee(2), set the
      length of the output buffer and advance the offset on the input
      buffer.
      
      Since splice is operating on two pipes, special care needs to be taken
      with locking to prevent AN ABBA deadlock.  Again this is done
      similarly to the tee(2) syscall, first preparing the input and output
      pipes so there's data to consume and space for that data, and then
      doing the move operation while holding both locks.
      
      If other processes are doing I/O on the same pipes parallel to the
      splice, then by the time both inodes are locked there might be no
      buffers left to move, or no space to move them to.  In this case retry
      the whole operation, including the preparation phase.  This could lead
      to starvation, but I'm not sure if that's serious enough to worry
      about.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      7c77f0b3
  8. 17 Apr, 2009 1 commit
  9. 15 Apr, 2009 6 commits
  10. 07 Apr, 2009 1 commit
    • Miklos Szeredi's avatar
      splice: fix deadlock in splicing to file · 7bfac9ec
      Miklos Szeredi authored
      
      There's a possible deadlock in generic_file_splice_write(),
      splice_from_pipe() and ocfs2_file_splice_write():
      
       - task A calls generic_file_splice_write()
       - this calls inode_double_lock(), which locks i_mutex on both
         pipe->inode and target inode
       - ordering depends on inode pointers, can happen that pipe->inode is
         locked first
       - __splice_from_pipe() needs more data, calls pipe_wait()
       - this releases lock on pipe->inode, goes to interruptible sleep
       - task B calls generic_file_splice_write(), similarly to the first
       - this locks pipe->inode, then tries to lock inode, but that is
         already held by task A
       - task A is interrupted, it tries to lock pipe->inode, but fails, as
         it is already held by task B
       - ABBA deadlock
      
      Fix this by explicitly ordering locks: the outer lock must be on
      target inode and the inner lock (which is later unlocked and relocked)
      must be on pipe->inode.  This is OK, pipe inodes and target inodes
      form two nonoverlapping sets, generic_file_splice_write() and friends
      are not called with a target which is a pipe.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Acked-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bfac9ec
  11. 03 Apr, 2009 1 commit
  12. 14 Jan, 2009 1 commit
  13. 08 Jan, 2009 1 commit
    • KAMEZAWA Hiroyuki's avatar
      memcg: synchronized LRU · 08e552c6
      KAMEZAWA Hiroyuki authored
      
      A big patch for changing memcg's LRU semantics.
      
      Now,
        - page_cgroup is linked to mem_cgroup's its own LRU (per zone).
      
        - LRU of page_cgroup is not synchronous with global LRU.
      
        - page and page_cgroup is one-to-one and statically allocated.
      
        - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
          - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
      
        - SwapCache is handled.
      
      And, when we handle LRU list of page_cgroup, we do following.
      
      	pc = lookup_page_cgroup(page);
      	lock_page_cgroup(pc); .....................(1)
      	mz = page_cgroup_zoneinfo(pc);
      	spin_lock(&mz->lru_lock);
      	.....add to LRU
      	spin_unlock(&mz->lru_lock);
      	unlock_page_cgroup(pc);
      
      But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
      So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
      
      This is a trial to remove this dirty nesting of locks.
      This patch changes mz->lru_lock to be zone->lru_lock.
      Then, above sequence will be written as
      
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      	mem_cgroup_add/remove/etc_lru() {
      		pc = lookup_page_cgroup(page);
      		mz = page_cgroup_zoneinfo(pc);
      		if (PageCgroupUsed(pc)) {
      			....add to LRU
      		}
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      
      This is much simpler.
      (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
          1. When pc->mem_cgroup can be modified.
             - at charge.
             - at account_move().
          2. at charge
             the PCG_USED bit is not set before pc->mem_cgroup is fixed.
          3. at account_move()
             the page is isolated and not on LRU.
      
      Pros.
        - easy for maintenance.
        - memcg can make use of laziness of pagevec.
        - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
        - LRU status of memcg will be synchronized with global LRU's one.
        - # of locks are reduced.
        - account_move() is simplified very much.
      Cons.
        - may increase cost of LRU rotation.
          (no impact if memcg is not configured.)
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08e552c6
  14. 30 Oct, 2008 1 commit
  15. 09 Oct, 2008 1 commit
    • Linus Torvalds's avatar
      Don't allow splice() to files opened with O_APPEND · efc968d4
      Linus Torvalds authored
      
      This is debatable, but while we're debating it, let's disallow the
      combination of splice and an O_APPEND destination.
      
      It's not entirely clear what the semantics of O_APPEND should be, and
      POSIX apparently expects pwrite() to ignore O_APPEND, for example.  So
      we could make up any semantics we want, including the old ones.
      
      But Miklos convinced me that we should at least give it some thought,
      and that accepting writes at arbitrary offsets is wrong at least for
      IS_APPEND() files (which always have O_APPEND set, even if the reverse
      isn't true: you can obviously have O_APPEND set on a regular file).
      
      So disallow O_APPEND entirely for now.  I doubt anybody cares, and this
      way we have one less gray area to worry about.
      Reported-and-argued-for-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Acked-by: default avatarJens Axboe <ens.axboe@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efc968d4
  16. 05 Aug, 2008 1 commit
  17. 27 Jul, 2008 1 commit
  18. 26 Jul, 2008 1 commit
    • Nick Piggin's avatar
      splice: use get_user_pages_fast · bc40d73c
      Nick Piggin authored
      
      Use get_user_pages_fast in splice.  This reverts some mmap_sem batching
      there, however the biggest problem with mmap_sem tends to be hold times
      blocking out other threads rather than cacheline bouncing.  Further: on
      architectures that implement get_user_pages_fast without locks, mmap_sem
      can be avoided completely anyway.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Reviewed-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc40d73c
  19. 04 Jul, 2008 1 commit
  20. 28 May, 2008 2 commits
  21. 08 May, 2008 1 commit
  22. 07 May, 2008 1 commit
  23. 29 Apr, 2008 1 commit
  24. 10 Apr, 2008 1 commit
  25. 03 Apr, 2008 1 commit
    • Hugh Dickins's avatar
      splice: use mapping_gfp_mask · 4cd13504
      Hugh Dickins authored
      
      The loop block driver is careful to mask __GFP_IO|__GFP_FS out of its
      mapping_gfp_mask, to avoid hangs under memory pressure.  But nowadays
      it uses splice, usually going through __generic_file_splice_read.  That
      must use mapping_gfp_mask instead of GFP_KERNEL to avoid those hangs.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4cd13504
  26. 04 Mar, 2008 1 commit
  27. 10 Feb, 2008 1 commit
  28. 08 Feb, 2008 1 commit
  29. 01 Feb, 2008 1 commit
  30. 29 Jan, 2008 1 commit
  31. 28 Jan, 2008 1 commit
  32. 25 Jan, 2008 1 commit