1. 24 Mar, 2011 2 commits
  2. 15 Mar, 2011 2 commits
  3. 15 Feb, 2011 1 commit
  4. 12 Feb, 2011 2 commits
    • Eric Sandeen's avatar
      ext4: serialize unaligned asynchronous DIO · e9e3bcec
      Eric Sandeen authored
      
      ext4 has a data corruption case when doing non-block-aligned
      asynchronous direct IO into a sparse file, as demonstrated
      by xfstest 240.
      
      The root cause is that while ext4 preallocates space in the
      hole, mappings of that space still look "new" and 
      dio_zero_block() will zero out the unwritten portions.  When
      more than one AIO thread is going, they both find this "new"
      block and race to zero out their portion; this is uncoordinated
      and causes data corruption.
      
      Dave Chinner fixed this for xfs by simply serializing all
      unaligned asynchronous direct IO.  I've done the same here.
      The difference is that we only wait on conversions, not all IO.
      This is a very big hammer, and I'm not very pleased with
      stuffing this into ext4_file_write().  But since ext4 is
      DIO_LOCKING, we need to serialize it at this high level.
      
      I tried to move this into ext4_ext_direct_IO, but by then
      we have the i_mutex already, and we will wait on the
      work queue to do conversions - which must also take the
      i_mutex.  So that won't work.
      
      This was originally exposed by qemu-kvm installing to
      a raw disk image with a normal sector-63 alignment.  I've
      tested a backport of this patch with qemu, and it does
      avoid the corruption.  It is also quite a lot slower
      (14 min for package installs, vs. 8 min for well-aligned)
      but I'll take slow correctness over fast corruption any day.
      
      Mingming suggested that we can track outstanding
      conversions, and wait on those so that non-sparse
      files won't be affected, and I've implemented that here;
      unaligned AIO to nonsparse files won't take a perf hit.
      
      [tytso@mit.edu: Keep the mutex as a hashed array instead
       of bloating the ext4 inode]
      
      [tytso@mit.edu: Fix up namespace issues so that global
       variables are protected with an "ext4_" prefix.]
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e9e3bcec
    • Eric Sandeen's avatar
      ext4: make grpinfo slab cache names static · 2892c15d
      Eric Sandeen authored
      In 2.6.37 I was running into oopses with repeated module
      loads & unloads.  I tracked this down to:
      
      fb1813f4
      
       ext4: use dedicated slab caches for group_info structures
      
      (this was in addition to the features advert unload problem)
      
      The kstrdup & subsequent kfree of the cache name was causing
      a double free.  In slub, at least, if I read it right it allocates
      & frees the name itself, slab seems to do something different...
      so in slub I think we were leaking -our- cachep->name, and double
      freeing the one allocated by slub.
      
      After getting lost in slab/slub/slob a bit, I just looked at other
      sized-caches that get allocated.  jbd2, biovec, sgpool all do it
      more or less the way jbd2 does.  Below patch follows the jbd2
      method of dynamically allocating a cache at mount time from
      a list of static names.
      
      (This might also possibly fix a race creating the caches with
      parallel mounts running).
      
      [Folded in a fix from Dan Carpenter which fixed an off-by-one error in
      the original patch]
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      2892c15d
  5. 07 Feb, 2011 1 commit
    • Curt Wohlgemuth's avatar
      ext4: Fix data corruption with multi-block writepages support · d50bdd5a
      Curt Wohlgemuth authored
      This fixes a corruption problem with the multi-block
      writepages submittal change for ext4, from commit
      bd2d0210 ("ext4: use bio
      layer instead of buffer layer in mpage_da_submit_io").
      
      (Note that this corruption is not present in 2.6.37 on
      ext4, because the corruption was detected after the
      feature was merged in 2.6.37-rc1, and so it was turned
      off by adding a non-default mount option,
      mblk_io_submit.  With this commit, which hopefully
      fixes the last of the bugs with this feature, we'll be
      able to turn on this performance feature by default in
      2.6.38, and remove the mblk_io_submit option.)
      
      The ext4 code path to bundle multiple pages for
      writeback in ext4_bio_write_page() had a bug: we should
      be clearing buffer head dirty flags *before* we submit
      the bio, not in the completion routine.
      
      The patch below was tested on 2.6.37 under KVM with the
      postgresql script which was submitted by Jon Nelson as
      documented in commit 1449032b
      
      .
      
      Without the patch, I'd hit the corruption problem about
      50-70% of the time.  With the patch, I executed the
      script > 100 times with no corruption seen.
      
      I also fixed a bug to make sure ext4_end_bio() doesn't
      dereference the bio after the bio_put() call.
      Reported-by: default avatarJon Nelson <jnelson@jamponi.net>
      Reported-by: default avatarMatthias Bayer <jackdachef@gmail.com>
      Signed-off-by: default avatarCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      d50bdd5a
  6. 03 Feb, 2011 3 commits
  7. 01 Feb, 2011 2 commits
    • Eric Paris's avatar
      fs/vfs/security: pass last path component to LSM on inode creation · 2a7dba39
      Eric Paris authored
      
      SELinux would like to implement a new labeling behavior of newly created
      inodes.  We currently label new inodes based on the parent and the creating
      process.  This new behavior would also take into account the name of the
      new object when deciding the new label.  This is not the (supposed) full path,
      just the last component of the path.
      
      This is very useful because creating /etc/shadow is different than creating
      /etc/passwd but the kernel hooks are unable to differentiate these
      operations.  We currently require that userspace realize it is doing some
      difficult operation like that and than userspace jumps through SELinux hoops
      to get things set up correctly.  This patch does not implement new
      behavior, that is obviously contained in a seperate SELinux patch, but it
      does pass the needed name down to the correct LSM hook.  If no such name
      exists it is fine to pass NULL.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      2a7dba39
    • Tejun Heo's avatar
      ext4: convert to alloc_workqueue() · fd89d5f2
      Tejun Heo authored
      
      Convert create_workqueue() to alloc_workqueue().  This is an identity
      conversion.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: linux-ext4@vger.kernel.org
      fd89d5f2
  8. 21 Jan, 2011 1 commit
  9. 17 Jan, 2011 2 commits
    • Christoph Hellwig's avatar
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig authored
      
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • Christoph Hellwig's avatar
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig authored
      
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  10. 13 Jan, 2011 2 commits
  11. 12 Jan, 2011 1 commit
    • Jan Kara's avatar
      quota: Fix deadlock during path resolution · f00c9e44
      Jan Kara authored
      
      As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
      is prone to deadlocks. We hold s_umount semaphore for reading during the
      path resolution and resolution itself may need to acquire the semaphore
      for writing when e. g. autofs mountpoint is passed.
      
      Solve the problem by performing the resolution before we get hold of the
      superblock (and thus s_umount semaphore). The whole thing is complicated
      by the fact that some filesystems (OCFS2) ignore the path argument. So to
      distinguish between filesystem which want the path and which do not we
      introduce new .quota_on_meta callback which does not get the path. OCFS2
      then uses this callback instead of old .quota_on.
      
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      CC: Christoph Hellwig <hch@lst.de>
      CC: Ted Ts'o <tytso@mit.edu>
      CC: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      f00c9e44
  12. 11 Jan, 2011 2 commits
  13. 10 Jan, 2011 19 commits
    • Eric Sandeen's avatar
      ext4: don't pass entire map to check_eofblocks_fl · d002ebf1
      Eric Sandeen authored
      
      Since check_eofblocks_fl() only uses the m_lblk portion of the map
      structure, we may as well pass that directly, rather than passing the
      entire map, which IMHO obfuscates what parameters check_eofblocks_fl()
      cares about.  Not a big deal, but seems tidier and less confusing, to
      me.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      d002ebf1
    • Theodore Ts'o's avatar
      ext4: fix memory leak in ext4_free_branches · 1c5b9e90
      Theodore Ts'o authored
      Commit 40389687
      
       moved a call to ext4_forget() out of
      ext4_free_branches and let ext4_free_blocks() handle calling
      bforget().  But that change unfortunately did not replace the call to
      ext4_forget() with brelse(), which was needed to drop the in-use count
      of the indirect block's buffer head, which lead to a memory leak when
      deleting files that used indirect blocks.  Fix this.
      
      Thanks to Hugh Dickins for pointing this out.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1c5b9e90
    • Theodore Ts'o's avatar
      ext4: remove ext4_mb_return_to_preallocation() · a5196f8c
      Theodore Ts'o authored
      
      This function was never implemented, except for a BUG_ON which was
      tripping when ext4 is run without a journal.  The problem is that
      although the comment asserts that "truncate (which is the only way to
      free block) discards all preallocations", ext4_free_blocks() is also
      called in various error recovery paths when blocks have been
      allocated, but for various reasons, we were not able to use those data
      blocks (for example, because we ran out of memory while trying to
      manipulate the extent tree, or some other similar situation).
      
      In addition to the fact that this function isn't implemented except
      for the incorrect BUG_ON, the single caller of this function,
      ext4_free_blocks(), doesn't use it all if the journal is enabled.
      
      So remove the (stub) function entirely for now.  If we decide it's
      better to add it back, it's only going to be useful with a relatively
      large number of code changes anyway.
      
      Google-Bug-Id: 3236408
      
      Cc: Jiaying Zhang <jiayingz@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a5196f8c
    • Jiaying Zhang's avatar
      ext4: flush the i_completed_io_list during ext4_truncate · 3889fd57
      Jiaying Zhang authored
      
      Ted first found the bug when running 2.6.36 kernel with dioread_nolock
      mount option that xfstests #13 complained about wrong file size during fsck.
      However, the bug exists in the older kernels as well although it is
      somehow harder to trigger.
      
      The problem is that ext4_end_io_work() can happen after we have truncated an
      inode to a smaller size. Then when ext4_end_io_work() calls 
      ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
      been truncated, so the inode size becomes inconsistent with the allocated
      blocks. 
      
      The following patch flushes the i_completed_io_list during truncate to reduce 
      the risk that some pending end_io requests are executed later and convert 
      already truncated blocks to initialized. 
      
      Note that although the fix helps reduce the problem a lot there may still 
      be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
      problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
      held, it can race with an ongoing write request so that the io_end request is
      processed later when the corresponding blocks have been truncated.
      
      Ted and I have discussed the problem offline and we saw a few ways to fix
      the race completely:
      
      a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
      whenever vmtruncate() is called. The i_mutex lock prevents any new write
      requests from entering writeback and the i_alloc_sem prevents the race
      from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
      is called from do_truncate(), which is probably the most common case.
      However, there are places where we may call vmtruncate() without holding
      either i_mutex or i_alloc_sem. I would like to ask for other people's
      opinions on what locks are expected to be held before calling vmtruncate().
      There seems a disagreement among the callers of that function.
      
      b) We change the ext4 write path so that we change the extent tree to contain 
      the newly allocated blocks and update i_size both at the same time --- when 
      the write of the data blocks is completed.
      
      c) We add some additional locking to synchronize vmtruncate() and 
      ext4_end_io_work(). This approach may have performance implications so we
      need to be careful.
      
      All of the above proposals may require more substantial changes, so
      we may consider to take the following patch as a bandaid.
      Signed-off-by: default avatarJiaying Zhang <jiayingz@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      3889fd57
    • Theodore Ts'o's avatar
      ext4: add error checking to calls to ext4_handle_dirty_metadata() · b4097142
      Theodore Ts'o authored
      
      Call ext4_std_error() in various places when we can't bail out
      cleanly, so the file system can be marked as in error.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b4097142
    • Jan Kara's avatar
      ext4: fix trimming of a single group · ca6e909f
      Jan Kara authored
      
      When ext4_trim_fs() is called to trim a part of a single group, the
      logic will wrongly set last block of the interval to 'len' instead
      of 'first_block + len'. Thus a shorter interval is possibly trimmed.
      Fix it.
      
      CC: Lukas Czerner <lczerner@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      ca6e909f
    • Andrew Morton's avatar
      ext4: fix uninitialized variable in ext4_register_li_request · 6c5a6cb9
      Andrew Morton authored
      
      fs/ext4/super.c: In function 'ext4_register_li_request':
      fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function
      
      It looks buggy to me, too.
      
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      6c5a6cb9
    • Theodore Ts'o's avatar
      ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary · 8aefcd55
      Theodore Ts'o authored
      
      Replace the jbd2_inode structure (which is 48 bytes) with a pointer
      and only allocate the jbd2_inode when it is needed --- that is, when
      the file system has a journal present and the inode has been opened
      for writing.  This allows us to further slim down the ext4_inode_info
      structure.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8aefcd55
    • Theodore Ts'o's avatar
      ext4: drop i_state_flags on architectures with 64-bit longs · 353eb83c
      Theodore Ts'o authored
      
      We can store the dynamic inode state flags in the high bits of
      EXT4_I(inode)->i_flags, and eliminate i_state_flags.  This saves 8
      bytes from the size of ext4_inode_info structure, which when
      multiplied by the number of the number of in the inode cache, can save
      a lot of memory.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      353eb83c
    • Theodore Ts'o's avatar
      ext4: reorder ext4_inode_info structure elements to remove unneeded padding · 8a2005d3
      Theodore Ts'o authored
      
      By reordering the elements in the ext4_inode_info structure, we can
      reduce the padding needed on an x86_64 system by 16 bytes.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8a2005d3
    • Theodore Ts'o's avatar
      ext4: drop ec_type from the ext4_ext_cache structure · b05e6ae5
      Theodore Ts'o authored
      
      We can encode the ec_type information by using ee_len == 0 to denote
      EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if
      neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT.
      This allows us to reduce the size of ext4_ext_inode by another 8
      bytes.  (ec_type is 4 bytes, plus another 4 bytes of padding)
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b05e6ae5
    • Theodore Ts'o's avatar
      ext4: use ext4_lblk_t instead of sector_t for logical blocks · 01f49d0b
      Theodore Ts'o authored
      
      This fixes a number of places where we used sector_t instead of
      ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data
      types.  No point wasting space in the ext4_inode_info structure, and
      requiring 64-bit arithmetic on 32-bit systems, when it isn't
      necessary.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      01f49d0b
    • Theodore Ts'o's avatar
      ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED · f2321097
      Theodore Ts'o authored
      
      Remove the short element i_delalloc_reserved_flag from the
      ext4_inode_info structure and replace it a new bit in i_state_flags.
      Since we have an ext4_inode_info for every ext4 inode cached in the
      inode cache, any savings we can produce here is a very good thing from
      a memory utilization perspective.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f2321097
    • Kazuya Mio's avatar
      ext4: fix 32bit overflow in ext4_ext_find_goal() · ad4fb9ca
      Kazuya Mio authored
      
      ext4_ext_find_goal() returns an ideal physical block number that the block
      allocator tries to allocate first. However, if a required file offset is
      smaller than the existing extent's one, ext4_ext_find_goal() returns
      a wrong block number because it may overflow at
      "block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.
      
      ext4_ext_find_goal() will also return a wrong block number in case
      a file offset of the existing extent is too big. In this case,
      the ideal physical block number is fixed in ext4_mb_initialize_context(),
      so it's no problem.
      
      reproduce:
      # dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
      # dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
      # filefrag -v /mnt/mp1/file
      Filesystem type is: ef53
      File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
       ext logical physical expected length flags
         0     128    67456             128 eof
      /mnt/mp1/file: 2 extents found
      # rm -rf /mnt/mp1/tmp
      # echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
      # dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc
      
      result (linux-2.6.37-rc2 + ext4 patch queue):
      # filefrag -v /mnt/mp1/file
      Filesystem type is: ef53
      File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
       ext logical physical expected length flags
         0       0    33280             128 
         1     128    67456    33407    128 eof
      /mnt/mp1/file: 2 extents found
      
      result(apply this patch):
      # filefrag -v /mnt/mp1/file
      Filesystem type is: ef53
      File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
       ext logical physical expected length flags
         0       0    66560             128 
         1     128    67456    66687    128 eof
      /mnt/mp1/file: 2 extents found
      Signed-off-by: default avatarKazuya Mio <k-mio@sx.jp.nec.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      ad4fb9ca
    • Namhyung Kim's avatar
      ext4: add more error checks to ext4_mkdir() · dabd991f
      Namhyung Kim authored
      
      Check return value of ext4_journal_get_write_access,
      ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse()
      under 'out_stop' to release bh properly in case of journal error.
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      dabd991f
    • Eric Paris's avatar
      ext4: ext4_ext_migrate should use NULL not 0 · f1dffc4c
      Eric Paris authored
      
      ext4_ext_migrate() calls ext4_new_inode() and passes 0 instead of a pointer
      to a struct qstr.  This patch uses NULL, to make it obvious to the caller
      that this was a pointer.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f1dffc4c
    • Theodore Ts'o's avatar
      ext4: Use ext4_error_file() to print the pathname to the corrupted inode · f7c21177
      Theodore Ts'o authored
      
      Where the file pointer is available, use ext4_error_file() instead of
      ext4_error_inode().
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f7c21177
    • Dan Carpenter's avatar
      ext4: use IS_ERR() to check for errors in ext4_error_file · f9a62d09
      Dan Carpenter authored
      
      d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
      ext4_error_file() and no one actually calls ext4_error_file().
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      f9a62d09
    • Dan Carpenter's avatar
      ext4: test the correct variable in ext4_init_pageio() · 13195184
      Dan Carpenter authored
      
      This is a copy and paste error.  The intent was to check
      "io_page_cachep".  We tested "io_page_cachep" earlier.
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      13195184