- 31 Mar, 2009 1 commit
-
-
Theodore Ts'o authored
Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 28 Mar, 2009 1 commit
-
-
Aneesh Kumar K.V authored
With delayed allocation we should not/cannot discard inode prealloc space during file close. We would still have dirty pages for which we haven't allocated blocks yet. With this fix after each get_blocks request we check whether we have zero reserved blocks and if yes and we don't have any writers on the file we discard inode prealloc space. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 24 Feb, 2009 1 commit
-
-
Theodore Ts'o authored
When closing a file that had been previously truncated, force any delay allocated blocks that to be allocated so that if the filesystem is mounted with data=ordered, the data blocks will be pushed out to disk along with the journal commit. Many application programs expect this, so we do this to avoid zero length files if the system crashes unexpectedly. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 26 Feb, 2009 1 commit
-
-
Theodore Ts'o authored
Add an ioctl which forces all of the delay allocated blocks to be allocated. This also provides a function ext4_alloc_da_blocks() which will be used by the following commits to force files to be fully allocated to preserve application-expected ext3 behaviour. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 23 Feb, 2009 3 commits
-
-
Theodore Ts'o authored
The mpage_da_writepages() function is only used in one place, so inline it to simplify the call stack and make the code easier to understand. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
Theodore Ts'o authored
Struct mpage_da_data and mpage_add_bh_to_extent() use a fake struct buffer_head which is 104 bytes on an x86_64 system, but only use 24 bytes of the structure. On systems that use a spinlock for atomic_t, the stack savings will be even greater. It turns out that using a fake struct buffer_head doesn't even save that much code, and it makes the code more confusing since it's not used as a "real" buffer head. So just store pass b_size and b_state in mpage_add_bh_to_extent(), and store b_size, b_state, and b_block_nr in the mpage_da_data structure. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
Theodore Ts'o authored
This parameter was always set to ext4_da_get_block_write(). Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 27 Mar, 2009 1 commit
-
-
Aneesh Kumar K.V authored
Make sure we validate extent details only when read from the disk. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 12 Mar, 2009 1 commit
-
-
Theodore Ts'o authored
The find_group_flex() inode allocator is now only used if the filesystem is mounted using the "oldalloc" mount option. It is replaced with the original Orlov allocator that has been updated for flex_bg filesystems (it should behave the same way if flex_bg is disabled). The inode allocator now functions by taking into account each flex_bg group, instead of each block group, when deciding whether or not it's time to allocate a new directory into a fresh flex_bg. The block allocator has also been changed so that the first block group in each flex_bg is preferred for use for storing directory blocks. This keeps directory blocks close together, which is good for speeding up e2fsck since large directories are more likely to look like this: debugfs: stat /home/tytso/Maildir/cur Inode: 1844562 Type: directory Mode: 0700 Flags: 0x81000 Generation: 1132745781 Version: 0x00000000:0000ad71 User: 15806 Group: 15806 Size: 1060864 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 2072 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009 atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009 mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009 crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009 Size of extra inode fields: 28 BLOCKS: (0):7348651, (1-258):7348654-7348911 TOTAL: 259 Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 26 Mar, 2009 2 commits
-
-
Jan Kara authored
Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by:
Jan Kara <jack@suse.cz> Acked-by:
Mingming Cao <cmm@us.ibm.com> CC: linux-ext4@vger.kernel.org
-
Mingming Cao authored
Uses quota reservation/claim/release to handle quota properly for delayed allocation in the three steps: 1) quotas are reserved when data being copied to cache when block allocation is defered 2) when new blocks are allocated. reserved quotas are converted to the real allocated quota, 2) over-booked quotas for metadata blocks are released back. Signed-off-by:
Mingming Cao <cmm@us.ibm.com> Acked-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Jan Kara <jack@suse.cz>
-
- 26 Feb, 2009 1 commit
-
-
Eric Sandeen authored
Running without a journal, I oopsed when I ran out of space, because we called jbd2_journal_force_commit_nested() from ext4_should_retry_alloc() without a journal. This should take care of it, I think. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 23 Feb, 2009 1 commit
-
-
Jan Kara authored
Functions ext4_write_begin() and ext4_da_write_begin() call grab_cache_page_write_begin() without AOP_FLAG_NOFS. Thus it can happen that page reclaim is triggered in that function and it recurses back into the filesystem (or some other filesystem). But this can lead to various problems as a transaction is already started at that point. Add the necessary flag. http://bugzilla.kernel.org/show_bug.cgi?id=11688 Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 14 Feb, 2009 1 commit
-
-
Aneesh Kumar K.V authored
With delayed allocation we lock the page in write_cache_pages() and try to build an in memory extent of contiguous blocks. This is needed so that we can get large contiguous blocks request. If range_cyclic mode is enabled, write_cache_pages() will loop back to the 0 index if no I/O has been done yet, and try to start writing from the beginning of the range. That causes an attempt to take the page lock of lower index page while holding the page lock of higher index page, which can cause a dead lock with another writeback thread. The solution is to implement the range_cyclic behavior in ext4_da_writepages() instead. http://bugzilla.kernel.org/show_bug.cgi?id=12579 Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 10 Feb, 2009 1 commit
-
-
Jan Kara authored
If we race with commit code setting i_transaction to NULL, we could possibly dereference it. Proper locking requires the journal pointer (to access journal->j_list_lock), which we don't have. So we have to change the prototype of the function so that filesystem passes us the journal pointer. Also add a more detailed comment about why the function jbd2_journal_begin_ordered_truncate() does what it does and how it should be used. Thanks to Dan Carpenter <error27@gmail.com> for pointing to the suspitious code. Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Acked-by:
Joel Becker <joel.becker@oracle.com> CC: linux-ext4@vger.kernel.org CC: ocfs2-devel@oss.oracle.com CC: mfasheh@suse.de CC: Dan Carpenter <error27@gmail.com>
-
- 30 Jan, 2009 1 commit
-
-
Theodore Ts'o authored
The code to support journal-less ext4 operation added a BUG to ext4_bmap() which fired if there was no journal and the EXT4_STATE_JDATA bit was set in the i_state field. This caused running the filefrag program (which uses the FIMBAP ioctl) to trigger a BUG(). The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's harmless for the bit to be set. We could add a check in __ext4_journalled_writepage() and ext4_journalled_write_end() to only set the EXT4_STATE_JDATA bit if the journal is present, but that adds an extra test and jump instruction. It's easier to simply remove the BUG check. http://bugzilla.kernel.org/show_bug.cgi?id=12568 Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
-
- 20 Jan, 2009 1 commit
-
-
Theodore Ts'o authored
When trying to unlink a file with indirect blocks on a filesystem without a journal, the "circular indirect block" sanity test was getting falsely triggered. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 17 Jan, 2009 1 commit
-
-
Theodore Ts'o authored
Directories are not allowed to be bigger than 2GB, so don't use i_size_high for anything other than regular files. E2fsck should complain about these inodes, but the simplest thing to do for the kernel is to only use i_size_high for regular files. This prevents an intentially corrupted filesystem from causing the kernel to burn a huge amount of CPU and issuing error messages such as: EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max Thanks to David Maciejak from Fortinet's FortiGuard Global Security Research Team for reporting this issue. http://bugzilla.kernel.org/show_bug.cgi?id=12375 Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
-
- 06 Jan, 2009 1 commit
-
-
Eric Dumazet authored
For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS Considering more and more distros are using high NR_CPUS values, it makes sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS. A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This minimum value helps branch prediction in __percpu_counter_add()) We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically. We rename FBC_BATCH to percpu_counter_batch since its not a constant anymore. Signed-off-by:
Eric Dumazet <dada1@cosmosbay.com> Acked-by:
David S. Miller <davem@davemloft.net> Acked-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 04 Jan, 2009 2 commits
-
-
Nick Piggin authored
With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by:
Nick Piggin <npiggin@suse.de> Reviewed-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Theodore Ts'o authored
Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 06 Jan, 2009 1 commit
-
-
Aneesh Kumar K.V authored
Rename the lower bits with suffix _lo and add helper to access the values. Also rename bg_itable_unused_hi to bg_pad as in e2fsprogs. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 31 Dec, 2008 1 commit
-
-
Duane Griffin authored
Ensure fast symlink targets are NUL-terminated, even if corrupted on-disk. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: adilger@sun.com Cc: linux-ext4@vger.kernel.org Signed-off-by:
Duane Griffin <duaneg@dghda.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 07 Nov, 2008 1 commit
-
-
Aneesh Kumar K.V authored
We need to make sure we mark the buffer_heads as dirty and uptodate so that block_write_full_page write them correctly. This fixes mmap corruptions that can occur in low memory situations. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 22 Nov, 2008 1 commit
-
-
Aneesh Kumar K.V authored
* Change EXT4_HAS_*_FEATURE to return a boolean * Add a function prototype for ext4_fiemap() in ext4.h * Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions * Add lock annotations to mb_free_blocks() Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 06 Nov, 2008 1 commit
-
-
Theodore Ts'o authored
This fixes a 2.6.27 regression which was introduced in commit a02908f1. We weren't passing the chunk parameter down to the two subections, ext4_indirect_trans_blocks() and ext4_ext_index_trans_blocks(), with the result that massively overestimate the amount of credits needed by ext4_da_writepages, especially in the non-extents case. This causes failures especially on /boot partitions, which tend to be small and non-extent using since GRUB doesn't handle extents. This patch fixes the bug reported by Joseph Fannin at: http://bugzilla.kernel.org/show_bug.cgi?id=11964 Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 05 Nov, 2008 1 commit
-
-
Theodore Ts'o authored
Convert the unsigned longs that are most responsible for bloating the stack usage on 64-bit systems. Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 07 Jan, 2009 1 commit
-
-
Frank Mayhar authored
A few weeks ago I posted a patch for discussion that allowed ext4 to run without a journal. Since that time I've integrated the excellent comments from Andreas and fixed several serious bugs. We're currently running with this patch and generating some performance numbers against both ext2 (with backported reservations code) and ext4 with and without a journal. It just so happens that running without a journal is slightly faster for most everything. We did iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2 which creates 4 threads, each of which create and do reads and writes on a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens to bypass the page cache. Results: ext2 ext4, default ext4, no journal initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s reads 15.2 MB/s 16.9 MB/s 17.2 MB/s re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s So it seems that, so far, this was a useful exercise. Signed-off-by:
Frank Mayhar <fmayhar@google.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 06 Jan, 2009 1 commit
-
-
Aneesh Kumar K.V authored
When iterating through the pages which have mapped buffer_heads, we failed to update the b_state value. This results in allocating blocks at logical offset 0. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
-
- 05 Nov, 2008 1 commit
-
-
Theodore Ts'o authored
If the filesystem has errors, ext4_da_writepages() will return a *lot* of errors, including lots and lots of stack dumps. While it's true that we are dropping user data on the floor, which is unfortunate, the stack dumps aren't helpful, and they tend to obscure the true original root cause of the problem. So in the case where the filesystem has aborted, return an EROFS right away. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 02 Jan, 2009 1 commit
-
-
Theodore Ts'o authored
There was only one caller of the compatibility function ext4_new_blocks(), in balloc.c's ext4_alloc_blocks(). Change it to call ext4_mb_new_blocks() directly, and remove ext4_new_blocks() altogether. This cleans up the code, by removing two extra functions from the call chain, and hopefully saving some stack usage. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 29 Oct, 2008 1 commit
-
-
Alexander Beregalov authored
fs/ext4/balloc.c:607: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1822: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1824: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' Signed-off-by:
Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by:
Theodore Ts'o <tytso@mit.edu>
-
- 17 Oct, 2008 1 commit
-
-
Theodore Ts'o authored
If the HUGE_FILE feature flag is not set, don't allow the creation of large files, instead of automatically enabling the feature flag. Recent versions of mke2fs will set the HUGE_FILE flag automatically anyway for ext4 filesystems. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 16 Oct, 2008 1 commit
-
-
Aneesh Kumar K.V authored
The range_cyclic writeback mode uses the address_space writeback_index as the start index for writeback. With delayed allocation we were updating writeback_index wrongly resulting in highly fragmented file. This patch reduces the number of extents reduced from 4000 to 27 for a 3GB file. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Theodore Ts'o <tytso@mit.edu>
-
- 14 Oct, 2008 1 commit
-
-
Aneesh Kumar K.V authored
This enables us to drop the range_cont writeback mode use from ext4_da_writepages. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-
- 11 Oct, 2008 1 commit
-
-
Theodore Ts'o authored
The ext4 filesystem is getting stable enough that it's time to drop the "dev" prefix. Also remove the requirement for the TEST_FILESYS flag. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 07 Oct, 2008 1 commit
-
-
Eric Sandeen authored
ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org
-
- 10 Oct, 2008 2 commits
-
-
Theodore Ts'o authored
Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
Theodore Ts'o authored
With modern hard drives, reading 64k takes roughly the same time as reading a 4k block. So request readahead for adjacent inode table blocks to reduce the time it takes when iterating over directories (especially when doing this in htree sort order) in a cold cache case. With this patch, the time it takes to run "git status" on a kernel tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches" is reduced by 21%. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
- 13 Sep, 2008 1 commit
-
-
Aneesh Kumar K.V authored
With delayed allocation we use i_data_sem to update i_disksize. We need to update i_disksize only if the new size specified is greater than the current value and we need to make sure we don't race with other i_disksize update. With delayed allocation we will switch to the write_begin function for non-delayed allocation if we are low on free blocks. This means the write_begin function for non-delayed allocation also needs to use the same locking. We also need to check and update i_disksize even if the new size is less that inode.i_size because of delayed allocation. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-