1. 20 Oct, 2009 1 commit
    • Andreas Gruenbacher's avatar
      dnotify: ignore FS_EVENT_ON_CHILD · 94552684
      Andreas Gruenbacher authored
      Mask off FS_EVENT_ON_CHILD in dnotify_handle_event().  Otherwise, when there
      is more than one watch on a directory and dnotify_should_send_event()
      succeeds, events with FS_EVENT_ON_CHILD set will trigger all watches and cause
      spurious events.
      
      This case was overlooked in commit e42e2773
      
      .
      
      	#define _GNU_SOURCE
      
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <signal.h>
      	#include <sys/types.h>
      	#include <sys/stat.h>
      	#include <fcntl.h>
      	#include <string.h>
      
      	static void create_event(int s, siginfo_t* si, void* p)
      	{
      		printf("create\n");
      	}
      
      	static void delete_event(int s, siginfo_t* si, void* p)
      	{
      		printf("delete\n");
      	}
      
      	int main (void) {
      		struct sigaction action;
      		char *tmpdir, *file;
      		int fd1, fd2;
      
      		sigemptyset (&action.sa_mask);
      		action.sa_flags = SA_SIGINFO;
      
      		action.sa_sigaction = create_event;
      		sigaction (SIGRTMIN + 0, &action, NULL);
      
      		action.sa_sigaction = delete_event;
      		sigaction (SIGRTMIN + 1, &action, NULL);
      
      	#	define TMPDIR "/tmp/test.XXXXXX"
      		tmpdir = malloc(strlen(TMPDIR) + 1);
      		strcpy(tmpdir, TMPDIR);
      		mkdtemp(tmpdir);
      
      	#	define TMPFILE "/file"
      		file = malloc(strlen(tmpdir) + strlen(TMPFILE) + 1);
      		sprintf(file, "%s/%s", tmpdir, TMPFILE);
      
      		fd1 = open (tmpdir, O_RDONLY);
      		fcntl(fd1, F_SETSIG, SIGRTMIN);
      		fcntl(fd1, F_NOTIFY, DN_MULTISHOT | DN_CREATE);
      
      		fd2 = open (tmpdir, O_RDONLY);
      		fcntl(fd2, F_SETSIG, SIGRTMIN + 1);
      		fcntl(fd2, F_NOTIFY, DN_MULTISHOT | DN_DELETE);
      
      		if (fork()) {
      			/* This triggers a create event */
      			creat(file, 0600);
      			/* This triggers a create and delete event (!) */
      			unlink(file);
      		} else {
      			sleep(1);
      			rmdir(tmpdir);
      		}
      
      		return 0;
      	}
      Signed-off-by: default avatarAndreas Gruenbacher <agruen@suse.de>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      94552684
  2. 18 Oct, 2009 2 commits
    • Wei Yongjun's avatar
      inotify: fix coalesce duplicate events into a single event in special case · 3de0ef4f
      Wei Yongjun authored
      
      If we do rename a dir entry, like this:
      
        rename("/tmp/ino7UrgoJ.rename1", "/tmp/ino7UrgoJ.rename2")
        rename("/tmp/ino7UrgoJ.rename2", "/tmp/ino7UrgoJ")
      
      The duplicate events should be coalesced into a single event. But those two
      events do not be coalesced into a single event, due to some bad check in
      event_compare(). It can not match the two NULL inodes as the same event.
      Signed-off-by: default avatarWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      3de0ef4f
    • Eric Paris's avatar
      fsnotify: do not set group for a mark before it is on the i_list · 9f0d793b
      Eric Paris authored
      
      fsnotify_add_mark is supposed to add a mark to the g_list and i_list and to
      set the group and inode for the mark.  fsnotify_destroy_mark_by_entry uses
      the fact that ->group != NULL to know if this group should be destroyed or
      if it's already been done.
      
      But fsnotify_add_mark sets the group and inode before it actually adds the
      mark to the i_list and g_list.  This can result in a race in inotify, it
      requires 3 threads.
      
      sys_inotify_add_watch("file")	sys_inotify_add_watch("file")	sys_inotify_rm_watch([a])
      inotify_update_watch()
      inotify_new_watch()
      inotify_add_to_idr()
         ^--- returns wd = [a]
      				inotfiy_update_watch()
      				inotify_new_watch()
      				inotify_add_to_idr()
      				fsnotify_add_mark()
      				   ^--- returns wd = [b]
      				returns to userspace;
      								inotify_idr_find([a])
      								   ^--- gives us the pointer from task 1
      fsnotify_add_mark()
         ^--- this is going to set the mark->group and mark->inode fields, but will
      return -EEXIST because of the race with [b].
      								fsnotify_destroy_mark()
      								   ^--- since ->group != NULL we call back
      									into inotify_freeing_mark() which calls
      								inotify_remove_from_idr([a])
      
      since fsnotify_add_mark() failed we call:
      inotify_remove_from_idr([a])     <------WHOOPS it's not in the idr, this could
      					have been any entry added later!
      
      The fix is to make sure we don't set mark->group until we are sure the mark is
      on the inode and fsnotify_add_mark will return success.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      9f0d793b
  3. 28 Aug, 2009 3 commits
  4. 27 Aug, 2009 4 commits
  5. 17 Aug, 2009 3 commits
  6. 21 Jul, 2009 7 commits
    • Eric Paris's avatar
      inotify: use GFP_NOFS under potential memory pressure · f44aebcc
      Eric Paris authored
      
      inotify can have a watchs removed under filesystem reclaim.
      
      =================================
      [ INFO: inconsistent lock state ]
      2.6.31-rc2 #16
      ---------------------------------
      inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
      khubd/217 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (iprune_mutex){+.+.?.}, at: [<c10ba899>] invalidate_inodes+0x20/0xe3
      {IN-RECLAIM_FS-W} state was registered at:
        [<c10536ab>] __lock_acquire+0x2c9/0xac4
        [<c1053f45>] lock_acquire+0x9f/0xc2
        [<c1308872>] __mutex_lock_common+0x2d/0x323
        [<c1308c00>] mutex_lock_nested+0x2e/0x36
        [<c10ba6ff>] shrink_icache_memory+0x38/0x1b2
        [<c108bfb6>] shrink_slab+0xe2/0x13c
        [<c108c3e1>] kswapd+0x3d1/0x55d
        [<c10449b5>] kthread+0x66/0x6b
        [<c1003fdf>] kernel_thread_helper+0x7/0x10
        [<ffffffff>] 0xffffffff
      
      Two things are needed to fix this.  First we need a method to tell
      fsnotify_create_event() to use GFP_NOFS and second we need to stop using
      one global IN_IGNORED event and allocate them one at a time.  This solves
      current issues with multiple IN_IGNORED on a queue having tail drop
      problems and simplifies the allocations since we don't have to worry about
      two tasks opperating on the IGNORED event concurrently.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      f44aebcc
    • Eric Paris's avatar
      fsnotify: fix inotify tail drop check with path entries · c05594b6
      Eric Paris authored
      
      fsnotify drops new events when they are the same as the tail event on the
      queue to be sent to userspace.  The problem is that if the event comes with
      a path we forget to break out of the switch statement and fall into the
      code path which matches on events that do not have any type of file backed
      information (things like IN_UNMOUNT and IN_Q_OVERFLOW).  The problem is
      that this code thinks all such events should be dropped.  Fix is to add a
      break.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      c05594b6
    • Eric Paris's avatar
      inotify: check filename before dropping repeat events · 4a148ba9
      Eric Paris authored
      
      inotify drops events if the last event on the queue is the same as the
      current event.  But it does 2 things wrong.  First it is comparing old->inode
      with new->inode.  But after an event if put on the queue the ->inode is no
      longer allowed to be used.  It's possible between the last event and this new
      event the inode could be reused and we would falsely match the inode's memory
      address between two differing events.
      
      The second problem is that when a file is removed fsnotify is passed the
      negative dentry for the removed object rather than the postive dentry from
      immediately before the removal.  This mean the (broken) inotify tail drop code
      was matching the NULL ->inode of differing events.
      
      The fix is to check the file name which is stored with events when doing the
      tail drop instead of wrongly checking the address of the stored ->inode.
      Reported-by: default avatarScott James Remnant <scott@ubuntu.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      4a148ba9
    • Eric Paris's avatar
      fsnotify: use def_bool in kconfig instead of letting the user choose · 520dc2a5
      Eric Paris authored
      
      fsnotify doens't give the user anything.  If someone chooses inotify or
      dnotify it should build fsnotify, if they don't select one it shouldn't be
      built.  This patch changes fsnotify to be a def_bool=n and makes everything
      else select it.  Also fixes the issue people complained about on lwn where
      gdm hung because they didn't have inotify and they didn't get the inotify
      build option.....
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      520dc2a5
    • Eric Paris's avatar
      inotify: fix error paths in inotify_update_watch · 7e790dd5
      Eric Paris authored
      
      inotify_update_watch could leave things in a horrid state on a number of
      error paths.  We could try to remove idr entries that didn't exist, we
      could send an IN_IGNORED to userspace for watches that don't exist, and a
      bit of other stupidity.  Clean these up by doing the idr addition before we
      put the mark on the inode since we can clean that up on error and getting
      off the inode's mark list is hard.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      7e790dd5
    • Eric Paris's avatar
      inotify: do not leak inode marks in inotify_add_watch · 75fe2b26
      Eric Paris authored
      
      inotify_add_watch had a couple of problems.  The biggest being that if
      inotify_add_watch was called on the same inode twice (to update or change the
      event mask) a refence was taken on the original inode mark by
      fsnotify_find_mark_entry but was not being dropped at the end of the
      inotify_add_watch call.  Thus if inotify_rm_watch was called although the mark
      was removed from the inode, the refcnt wouldn't hit zero and we would leak
      memory.
      Reported-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      75fe2b26
    • Eric Paris's avatar
      inotify: drop user watch count when a watch is removed · 5549f7cd
      Eric Paris authored
      
      The inotify rewrite forgot to drop the inotify watch use cound when a watch
      was removed.  This means that a single inotify fd can only ever register a
      maximum of /proc/sys/fs/max_user_watches even if some of those had been
      freed.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      5549f7cd
  7. 02 Jul, 2009 1 commit
  8. 19 Jun, 2009 1 commit
    • Eric Paris's avatar
      inotify: inotify_destroy_mark_entry could get called twice · 528da3e9
      Eric Paris authored
      
      inotify_destroy_mark_entry could get called twice for the same mark since it
      is called directly in inotify_rm_watch and when the mark is being destroyed for
      another reason.  As an example assume that the file being watched was just
      deleted so inotify_destroy_mark_entry would get called from the path
      fsnotify_inoderemove() -> fsnotify_destroy_marks_by_inode() ->
      fsnotify_destroy_mark_entry() -> inotify_destroy_mark_entry().  If this
      happened at the same time as userspace tried to remove a watch via
      inotify_rm_watch we could attempt to remove the mark from the idr twice and
      could thus double dec the ref cnt and potentially could be in a use after
      free/double free situation.  The fix is to have inotify_rm_watch use the
      generic recursive safe fsnotify_destroy_mark_by_entry() so we are sure the
      inotify_destroy_mark_entry() function can only be called one.
      
      This patch also renames the function to inotify_ingored_remove_idr() so it is
      clear what is actually going on in the function.
      
      Hopefully this fixes:
      [   20.342058] idr_remove called for id=20 which is not allocated.
      [   20.348000] Pid: 1860, comm: udevd Not tainted 2.6.30-tip #1077
      [   20.353933] Call Trace:
      [   20.356410]  [<ffffffff811a82b7>] idr_remove+0x115/0x18f
      [   20.361737]  [<ffffffff8134259d>] ? _spin_lock+0x6d/0x75
      [   20.367061]  [<ffffffff8111640a>] ? inotify_destroy_mark_entry+0xa3/0xcf
      [   20.373771]  [<ffffffff8111641e>] inotify_destroy_mark_entry+0xb7/0xcf
      [   20.380306]  [<ffffffff81115913>] inotify_freeing_mark+0xe/0x10
      [   20.386238]  [<ffffffff8111410d>] fsnotify_destroy_mark_by_entry+0x143/0x170
      [   20.393293]  [<ffffffff811163a3>] inotify_destroy_mark_entry+0x3c/0xcf
      [   20.399829]  [<ffffffff811164d1>] sys_inotify_rm_watch+0x9b/0xc6
      [   20.405850]  [<ffffffff8100bcdb>] system_call_fastpath+0x16/0x1b
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Tested-by: default avatarPeter Ziljlstra <peterz@infradead.org>
      528da3e9
  9. 11 Jun, 2009 15 commits
    • Eric Paris's avatar
      fsnotify: allow groups to set freeing_mark to null · a092ee20
      Eric Paris authored
      
      Most fsnotify listeners (all but inotify) do not care about marks being
      freed.  Allow groups to set freeing_mark to null and do not call any
      function if it is set that way.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      a092ee20
    • Eric Paris's avatar
      inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD · e42e2773
      Eric Paris authored
      
      inotify and dnotify will both indicate that they want any event which came
      from a child inode.  The fix is to mask off FS_EVENT_ON_CHILD when deciding
      if inotify or dnotify is interested in a given event.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      e42e2773
    • Eric Paris's avatar
      dnotify: do not bother to lock entry->lock when reading mask · ce61856b
      Eric Paris authored
      
      entry->lock is needed to make sure entry->mask does not change while
      manipulating it.  In dnotify_should_send_event() we don't care if we get an
      old or a new mask value out of this entry so there is no point it taking
      the lock.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      ce61856b
    • Eric Paris's avatar
      dnotify: do not use ?true:false when assigning to a bool · 5ac697b7
      Eric Paris authored
      
      dnotify_should send event assigned a bool using ?true:false when computing
      a bit operation.  This is poitless and the bool type does this for us.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      5ac697b7
    • Eric Paris's avatar
      inotify: reimplement inotify using fsnotify · 63c882a0
      Eric Paris authored
      
      Reimplement inotify_user using fsnotify.  This should be feature for feature
      exactly the same as the original inotify_user.  This does not make any changes
      to the in kernel inotify feature used by audit.  Those patches (and the eventual
      removal of in kernel inotify) will come after the new inotify_user proves to be
      working correctly.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      63c882a0
    • Eric Paris's avatar
      fsnotify: handle filesystem unmounts with fsnotify marks · 164bc619
      Eric Paris authored
      
      When an fs is unmounted with an fsnotify mark entry attached to one of its
      inodes we need to destroy that mark entry and we also (like inotify) send
      an unmount event.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      164bc619
    • Eric Paris's avatar
      fsnotify: fsnotify marks on inodes pin them in core · 1ef5f13c
      Eric Paris authored
      
      This patch pins any inodes with an fsnotify mark in core.  The idea is that
      as soon as the mark is removed from the inode->fsnotify_mark_entries list
      the inode will be iput.  In reality is doesn't quite work exactly this way.
      The igrab will happen when the mark is added to an inode, but the iput will
      happen when the inode pointer is NULL'd inside the mark.
      
      It's possible that 2 racing things will try to remove the mark from
      different directions.  One may try to remove the mark because of an
      explicit request and one might try to remove it because the inode was
      deleted.  It's possible that the removal because of inode deletion will
      remove the mark from the inode's list, but the removal by explicit request
      will actually set entry->inode == NULL; and call the iput.  This is safe.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      1ef5f13c
    • Eric Paris's avatar
      fsnotify: allow groups to add private data to events · e4aff117
      Eric Paris authored
      
      inotify needs per group information attached to events.  This patch allows
      groups to attach private information and implements a callback so that
      information can be freed when an event is being destroyed.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      e4aff117
    • Eric Paris's avatar
      fsnotify: add correlations between events · 47882c6f
      Eric Paris authored
      
      As part of the standard inotify events it includes a correlation cookie
      between two dentry move operations.  This patch includes the same behaviour
      in fsnotify events.  It is needed so that inotify userspace can be
      implemented on top of fsnotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      47882c6f
    • Eric Paris's avatar
      fsnotify: include pathnames with entries when possible · 62ffe5df
      Eric Paris authored
      
      When inotify wants to send events to a directory about a child it includes
      the name of the original file.  This patch collects that filename and makes
      it available for notification.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      62ffe5df
    • Eric Paris's avatar
      fsnotify: generic notification queue and waitq · a2d8bc6c
      Eric Paris authored
      
      inotify needs to do asyc notification in which event information is stored
      on a queue until the listener is ready to receive it.  This patch
      implements a generic notification queue for inotify (and later fanotify) to
      store events to be sent at a later time.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      a2d8bc6c
    • Eric Paris's avatar
      dnotify: reimplement dnotify using fsnotify · 3c5119c0
      Eric Paris authored
      
      Reimplement dnotify using fsnotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      3c5119c0
    • Eric Paris's avatar
      fsnotify: parent event notification · c28f7e56
      Eric Paris authored
      
      inotify and dnotify both use a similar parent notification mechanism.  We
      add a generic parent notification mechanism to fsnotify for both of these
      to use.  This new machanism also adds the dentry flag optimization which
      exists for inotify to dnotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      c28f7e56
    • Eric Paris's avatar
      fsnotify: add marks to inodes so groups can interpret how to handle those inodes · 3be25f49
      Eric Paris authored
      
      This patch creates a way for fsnotify groups to attach marks to inodes.
      These marks have little meaning to the generic fsnotify infrastructure
      and thus their meaning should be interpreted by the group that attached
      them to the inode's list.
      
      dnotify and inotify  will make use of these markings to indicate which
      inodes are of interest to their respective groups.  But this implementation
      has the useful property that in the future other listeners could actually
      use the marks for the exact opposite reason, aka to indicate which inodes
      it had NO interest in.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      3be25f49
    • Eric Paris's avatar
      fsnotify: unified filesystem notification backend · 90586523
      Eric Paris authored
      fsnotify is a backend for filesystem notification.  fsnotify does
      not provide any userspace interface but does provide the basis
      needed for other notification schemes such as dnotify.  fsnotify
      can be extended to be the backend for inotify or the upcoming
      fanotify.  fsnotify provides a mechanism for "groups" to register for
      some set of filesystem events and to then deliver those events to
      those groups for processing.
      
      fsnotify has a number of benefits, the first being actually shrinking the size
      of an inode.  Before fsnotify to support both dnotify and inotify an inode had
      
              unsigned long           i_dnotify_mask; /* Directory notify events */
              struct dnotify_struct   *i_dnotify; /* for directory notifications */
              struct list_head        inotify_watches; /* watches on this inode */
              struct mutex            inotify_mutex;  /* protects the watches list
      
      But with fsnotify this same functionallity (and more) is done with just
      
              __u32                   i_fsnotify_mask; /* all events for this inode */
              struct hlist_head       i_fsnotify_mark_entries; /* marks on this inode */
      
      That's right, inotify, dnotify, and fanotify all in 64 bits.  We used that
      much space just in inotify_watches alone, before this patch set.
      
      fsnotify object lifetime and locking is MUCH better than what we have today.
      inotify locking is incredibly complex.  See 8f7b0ba1 as an example of
      what's been busted since inception.  inotify needs to know internal semantics
      of superblock destruction and unmounting to function.  The inode pinning and
      vfs contortions are horrible.
      
      no fsnotify implementers do allocation under locks.  This means things like
      f04b30de
      
       which (due to an overabundance of caution) changes GFP_KERNEL to
      GFP_NOFS can be reverted.  There are no longer any allocation rules when using
      or implementing your own fsnotify listener.
      
      fsnotify paves the way for fanotify.  In brief fanotify is a notification
      mechanism that delivers the lisener both an 'event' and an open file descriptor
      to the object in question.  This means that fanotify is pathname agnostic.
      Some on lkml may not care for the original companies or users that pushed for
      TALPA, but fanotify was designed with flexibility and input for other users in
      mind.  The readahead group expressed interest in fanotify as it could be used
      to profile disk access on boot without breaking the audit system.  The desktop
      search groups have also expressed interest in fanotify as it solves a number
      of the race conditions and problems present with managing inotify when more
      than a limited number of specific files are of interest.  fanotify can provide
      for a userspace access control system which makes it a clean interface for AV
      vendors to hook without trying to do binary patching on the syscall table,
      LSM, and everywhere else they do their things today.  With this patch series
      fanotify can be implemented in less than 1200 lines of easy to review code.
      Almost all of which is the socket based user interface.
      
      This patch series builds fsnotify to the point that it can implement
      dnotify and inotify_user.  Patches exist and will be sent soon after
      acceptance to finish the in kernel inotify conversion (audit) and implement
      fanotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      90586523
  10. 06 May, 2009 1 commit
    • Wu Fengguang's avatar
      inotify: use GFP_NOFS in kernel_event() to work around a lockdep false-positive · 381a80e6
      Wu Fengguang authored
      
      There is what we believe to be a false positive reported by lockdep.
      
      inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
      kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
      dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock
      
      The plan is to fix this via lockdep annotation, but that is proving to be
      quite involved.
      
      The patch flips the allocation over to GFP_NFS to shut the warning up, for
      the 2.6.30 release.
      
      Hopefully we will fix this for real in 2.6.31.  I'll queue a patch in -mm
      to switch it back to GFP_KERNEL so we don't forget.
      
        =================================
        [ INFO: inconsistent lock state ]
        2.6.30-rc2-next-20090417 #203
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
        {RECLAIM_FS-ON-W} state was registered at:
          [<ffffffff81079188>] mark_held_locks+0x68/0x90
          [<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100
          [<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0
          [<ffffffff81130652>] kernel_event+0xe2/0x190
          [<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230
          [<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110
          [<ffffffff8110444d>] vfs_create+0xcd/0x140
          [<ffffffff8110825d>] do_filp_open+0x88d/0xa20
          [<ffffffff810f6b68>] do_sys_open+0x98/0x140
          [<ffffffff810f6c50>] sys_open+0x20/0x30
          [<ffffffff8100c272>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
        irq event stamp: 690455
        hardirqs last  enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80
        hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0
        softirqs last  enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220
        softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50
      
        other info that might help us debug this:
        2 locks held by kswapd0/380:
         #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180
         #1:  (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0
      
        stack backtrace:
        Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
        Call Trace:
         [<ffffffff810789ef>] print_usage_bug+0x19f/0x200
         [<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50
         [<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0
         [<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0
         [<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0
         [<ffffffff810f478c>] ? slob_free+0x10c/0x370
         [<ffffffff8107c0a1>] lock_acquire+0xe1/0x120
         [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
         [<ffffffff81562d43>] mutex_lock_nested+0x63/0x420
         [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
         [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
         [<ffffffff81012fe9>] ? sched_clock+0x9/0x10
         [<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0
         [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
         [<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0
         [<ffffffff8110cb23>] d_kill+0x33/0x60
         [<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350
         [<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0
         [<ffffffff810d0cc5>] shrink_slab+0x125/0x180
         [<ffffffff810d1540>] kswapd+0x560/0x7a0
         [<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0
         [<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40
         [<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10
         [<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0
         [<ffffffff8106555b>] kthread+0x5b/0xa0
         [<ffffffff8100d40a>] child_rip+0xa/0x20
         [<ffffffff8100cdd0>] ? restore_args+0x0/0x30
         [<ffffffff81065500>] ? kthread+0x0/0xa0
         [<ffffffff8100d400>] ? child_rip+0x0/0x20
      
      [eparis@redhat.com: fix audit too]
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      381a80e6
  11. 27 Mar, 2009 1 commit
    • Nick Piggin's avatar
      fs: avoid I_NEW inodes · aabb8fdb
      Nick Piggin authored
      
      To be on the safe side, it should be less fragile to exclude I_NEW inodes
      from inode list scans by default (unless there is an important reason to
      have them).
      
      Normally they will get excluded (eg.  by zero refcount or writecount etc),
      however it is a bit fragile for list walkers to know exactly what parts of
      the inode state is set up and valid to test when in I_NEW.  So along these
      lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
      checks with them too -- this shouldn't be a problem should it?)
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      aabb8fdb
  12. 18 Feb, 2009 1 commit
    • Ingo Molnar's avatar
      inotify: fix GFP_KERNEL related deadlock · f04b30de
      Ingo Molnar authored
      
      Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep
      assert:
      
      [ 1093.677775]
      [ 1093.677781] =================================
      [ 1093.680031] [ INFO: inconsistent lock state ]
      [ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
      [ 1093.680031] ---------------------------------
      [ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      [ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [ 1093.680031]  (&inode->inotify_mutex){+.+.?.}, at: [<c0205942>] inotify_inode_is_dead+0x20/0x80
      [ 1093.680031] {RECLAIM_FS-ON-W} state was registered at:
      [ 1093.680031]   [<c01696b9>] mark_held_locks+0x43/0x5b
      [ 1093.680031]   [<c016baa4>] lockdep_trace_alloc+0x6c/0x6e
      [ 1093.680031]   [<c01cf8b0>] kmem_cache_alloc+0x20/0x150
      [ 1093.680031]   [<c040d0ec>] idr_pre_get+0x27/0x6c
      [ 1093.680031]   [<c02056e3>] inotify_handle_get_wd+0x25/0xad
      [ 1093.680031]   [<c0205f43>] inotify_add_watch+0x7a/0x129
      [ 1093.680031]   [<c020679e>] sys_inotify_add_watch+0x20f/0x250
      [ 1093.680031]   [<c010389e>] sysenter_do_call+0x12/0x35
      [ 1093.680031]   [<ffffffff>] 0xffffffff
      [ 1093.680031] irq event stamp: 60417
      [ 1093.680031] hardirqs last  enabled at (60417): [<c018d5f5>] call_rcu+0x53/0x59
      [ 1093.680031] hardirqs last disabled at (60416): [<c018d5b9>] call_rcu+0x17/0x59
      [ 1093.680031] softirqs last  enabled at (59656): [<c0146229>] __do_softirq+0x157/0x16b
      [ 1093.680031] softirqs last disabled at (59651): [<c0106293>] do_softirq+0x74/0x15d
      [ 1093.680031]
      [ 1093.680031] other info that might help us debug this:
      [ 1093.680031] 2 locks held by kswapd0/308:
      [ 1093.680031]  #0:  (shrinker_rwsem){++++..}, at: [<c01b0502>] shrink_slab+0x36/0x189
      [ 1093.680031]  #1:  (&type->s_umount_key#4){+++++.}, at: [<c01e6d77>] shrink_dcache_memory+0x110/0x1fb
      [ 1093.680031]
      [ 1093.680031] stack backtrace:
      [ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
      [ 1093.680031] Call Trace:
      [ 1093.680031]  [<c016947a>] valid_state+0x12a/0x13d
      [ 1093.680031]  [<c016954e>] mark_lock+0xc1/0x1e9
      [ 1093.680031]  [<c016a5b4>] ? check_usage_forwards+0x0/0x3f
      [ 1093.680031]  [<c016ab74>] __lock_acquire+0x2c6/0xac8
      [ 1093.680031]  [<c01688d9>] ? register_lock_class+0x17/0x228
      [ 1093.680031]  [<c016b3d3>] lock_acquire+0x5d/0x7a
      [ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
      [ 1093.680031]  [<c08824c4>] __mutex_lock_common+0x3a/0x4cb
      [ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
      [ 1093.680031]  [<c08829ed>] mutex_lock_nested+0x2e/0x36
      [ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
      [ 1093.680031]  [<c0205942>] inotify_inode_is_dead+0x20/0x80
      [ 1093.680031]  [<c01e6672>] dentry_iput+0x90/0xc2
      [ 1093.680031]  [<c01e67a3>] d_kill+0x21/0x45
      [ 1093.680031]  [<c01e6a46>] __shrink_dcache_sb+0x27f/0x355
      [ 1093.680031]  [<c01e6dc5>] shrink_dcache_memory+0x15e/0x1fb
      [ 1093.680031]  [<c01b05ed>] shrink_slab+0x121/0x189
      [ 1093.680031]  [<c01b0d12>] kswapd+0x39f/0x561
      [ 1093.680031]  [<c01ae499>] ? isolate_pages_global+0x0/0x233
      [ 1093.680031]  [<c0157eae>] ? autoremove_wake_function+0x0/0x43
      [ 1093.680031]  [<c01b0973>] ? kswapd+0x0/0x561
      [ 1093.680031]  [<c0157daf>] kthread+0x41/0x82
      [ 1093.680031]  [<c0157d6e>] ? kthread+0x0/0x82
      [ 1093.680031]  [<c01043ab>] kernel_thread_helper+0x7/0x10
      
      inotify_handle_get_wd() does idr_pre_get() which does a
      kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under
      extreme MM pressure.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: MinChan Kim <minchan.kim@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f04b30de