1. 07 Mar, 2012 1 commit
    • Linus Torvalds's avatar
      vm: avoid using find_vma_prev() unnecessarily · 097d5910
      Linus Torvalds authored
      Several users of "find_vma_prev()" were not in fact interested in the
      previous vma if there was no primary vma to be found either.  And in
      those cases, we're much better off just using the regular "find_vma()",
      and then "prev" can be looked up by just checking vma->vm_prev.
      
      The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
      commit 83cd904d
      
      : "mm: fix find_vma_prev"), and the whole "return
      prev by reference" means that it generates worse code too.
      
      Thus this "let's avoid using this inconvenient and clearly too subtle
      interface when we don't really have to" patch.
      
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      097d5910
  2. 01 Nov, 2011 2 commits
  3. 31 Oct, 2011 1 commit
  4. 26 May, 2011 1 commit
  5. 05 May, 2011 1 commit
    • Linus Torvalds's avatar
      VM: skip the stack guard page lookup in get_user_pages only for mlock · a1fde08c
      Linus Torvalds authored
      
      The logic in __get_user_pages() used to skip the stack guard page lookup
      whenever the caller wasn't interested in seeing what the actual page
      was.  But Michel Lespinasse points out that there are cases where we
      don't care about the physical page itself (so 'pages' may be NULL), but
      do want to make sure a page is mapped into the virtual address space.
      
      So using the existence of the "pages" array as an indication of whether
      to look up the guard page or not isn't actually so great, and we really
      should just use the FOLL_MLOCK bit.  But because that bit was only set
      for the VM_LOCKED case (and not all vma's necessarily have it, even for
      mlock()), we couldn't do that originally.
      
      Fix that by moving the VM_LOCKED check deeper into the call-chain, which
      actually simplifies many things.  Now mlock() gets simpler, and we can
      also check for FOLL_MLOCK in __get_user_pages() and the code ends up
      much more straightforward.
      Reported-and-reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1fde08c
  6. 12 Apr, 2011 1 commit
    • Linus Torvalds's avatar
      vm: fix mlock() on stack guard page · 95042f9e
      Linus Torvalds authored
      Commit 53a7706d
      
       ("mlock: do not hold mmap_sem for extended periods
      of time") changed mlock() to care about the exact number of pages that
      __get_user_pages() had brought it.  Before, it would only care about
      errors.
      
      And that doesn't work, because we also handled one page specially in
      __mlock_vma_pages_range(), namely the stack guard page.  So when that
      case was handled, the number of pages that the function returned was off
      by one.  In particular, it could be zero, and then the caller would end
      up not making any progress at all.
      
      Rather than try to fix up that off-by-one error for the mlock case
      specially, this just moves the logic to handle the stack guard page
      into__get_user_pages() itself, thus making all the counts come out
      right automatically.
      Reported-by: default avatarRobert Święcki <robert@swiecki.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95042f9e
  7. 23 Mar, 2011 1 commit
  8. 01 Feb, 2011 1 commit
  9. 14 Jan, 2011 5 commits
    • Michel Lespinasse's avatar
      mlock: do not hold mmap_sem for extended periods of time · 53a7706d
      Michel Lespinasse authored
      
      __get_user_pages gets a new 'nonblocking' parameter to signal that the
      caller is prepared to re-acquire mmap_sem and retry the operation if
      needed.  This is used to split off long operations if they are going to
      block on a disk transfer, or when we detect contention on the mmap_sem.
      
      [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a7706d
    • Michel Lespinasse's avatar
      mm: move VM_LOCKED check to __mlock_vma_pages_range() · 5fdb2002
      Michel Lespinasse authored
      
      Use a single code path for faulting in pages during mlock.
      
      The reason to have it in this patch series is that I did not want to
      update both code paths in a later change that releases mmap_sem when
      blocking on disk.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fdb2002
    • Michel Lespinasse's avatar
      mm: add FOLL_MLOCK follow_page flag. · 110d74a9
      Michel Lespinasse authored
      
      Move the code to mlock pages from __mlock_vma_pages_range() to
      follow_page().
      
      This allows __mlock_vma_pages_range() to not have to break down work into
      16-page batches.
      
      An additional motivation for doing this within the present patch series is
      that it'll make it easier for a later chagne to drop mmap_sem when
      blocking on disk (we'd like to be able to resume at the page that was read
      from disk instead of at the start of a 16-page batch).
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      110d74a9
    • Michel Lespinasse's avatar
      mlock: only hold mmap_sem in shared mode when faulting in pages · fed067da
      Michel Lespinasse authored
      
      Currently mlock() holds mmap_sem in exclusive mode while the pages get
      faulted in.  In the case of a large mlock, this can potentially take a
      very long time, during which various commands such as 'ps auxw' will
      block.  This makes sysadmins unhappy:
      
      real    14m36.232s
      user    0m0.003s
      sys     0m0.015s
      (output from 'time ps auxw' while a 20GB file was being mlocked without
      being previously preloaded into page cache)
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages, in small batches, releasing mmap_sem when we
      block on disk access or when we detect some contention.
      
      This patch:
      
      Before this change, mlock() holds mmap_sem in exclusive mode while the
      pages get faulted in.  In the case of a large mlock, this can potentially
      take a very long time.  Various things will block while mmap_sem is held,
      including 'ps auxw'.  This can make sysadmins angry.
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages with mmap_sem held for reads only.  We need to
      recheck the vma flags after we re-acquire mmap_sem, but this is easy.
      
      In the case where a vma has been munlocked before mlock completes, pages
      that were already marked as PageMlocked() are handled by the munlock()
      call, and mlock() is careful to not mark new page batches as PageMlocked()
      after the munlock() call has cleared the VM_LOCKED vma flags.  So, the end
      result will be identical to what'd happen if munlock() had executed after
      the mlock() call.
      
      In a later change, I will allow the second pass to release mmap_sem when
      blocking on disk accesses or when it is otherwise contended, so that it
      won't be held for long periods of time even in shared mode.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Tested-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed067da
    • Michel Lespinasse's avatar
      mlock: avoid dirtying pages and triggering writeback · 5ecfda04
      Michel Lespinasse authored
      
      When faulting in pages for mlock(), we want to break COW for anonymous or
      file pages within VM_WRITABLE, non-VM_SHARED vmas.  However, there is no
      need to write-fault into VM_SHARED vmas since shared file pages can be
      mlocked first and dirtied later, when/if they actually get written to.
      Skipping the write fault is desirable, as we don't want to unnecessarily
      cause these pages to be dirtied and queued for writeback.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecfda04
  10. 09 Sep, 2010 1 commit
  11. 21 Aug, 2010 1 commit
  12. 15 Aug, 2010 1 commit
    • Linus Torvalds's avatar
      mm: fix up some user-visible effects of the stack guard page · d7824370
      Linus Torvalds authored
      
      This commit makes the stack guard page somewhat less visible to user
      space. It does this by:
      
       - not showing the guard page in /proc/<pid>/maps
      
         It looks like lvm-tools will actually read /proc/self/maps to figure
         out where all its mappings are, and effectively do a specialized
         "mlockall()" in user space.  By not showing the guard page as part of
         the mapping (by just adding PAGE_SIZE to the start for grows-up
         pages), lvm-tools ends up not being aware of it.
      
       - by also teaching the _real_ mlock() functionality not to try to lock
         the guard page.
      
         That would just expand the mapping down to create a new guard page,
         so there really is no point in trying to lock it in place.
      
      It would perhaps be nice to show the guard page specially in
      /proc/<pid>/maps (or at least mark grow-down segments some way), but
      let's not open ourselves up to more breakage by user space from programs
      that depends on the exact deails of the 'maps' file.
      
      Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
      source code to see what was going on with the whole new warning.
      
      Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
      Reported-by: default avatarHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7824370
  13. 26 Mar, 2010 1 commit
    • Peter Zijlstra's avatar
      x86, perf, bts, mm: Delete the never used BTS-ptrace code · faa4602e
      Peter Zijlstra authored
      
      Support for the PMU's BTS features has been upstreamed in
      v2.6.32, but we still have the old and disabled ptrace-BTS,
      as Linus noticed it not so long ago.
      
      It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
      regard for other uses (perf) and doesn't provide the flexibility
      needed for perf either.
      
      Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
      was never used and ptrace-block-step can be implemented using a
      much simpler approach.
      
      So axe all 3000 lines of it. That includes the *locked_memory*()
      APIs in mm/mlock.c as well.
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Markus Metzger <markus.t.metzger@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <20100325135413.938004390@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      faa4602e
  14. 06 Mar, 2010 1 commit
  15. 15 Dec, 2009 3 commits
  16. 22 Sep, 2009 3 commits
    • Hugh Dickins's avatar
      mm: m(un)lock avoid ZERO_PAGE · 6e919717
      Hugh Dickins authored
      
      I'm still reluctant to clutter __get_user_pages() with another flag, just
      to avoid touching ZERO_PAGE count in mlock(); though we can add that later
      if it shows up as an issue in practice.
      
      But when mlocking, we can test page->mapping slightly earlier, to avoid
      the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
      didn't lock_page in olden ZERO_PAGE days, so we might have regressed.
      
      And when munlocking, it turns out that FOLL_DUMP coincidentally does
      what's needed to avoid all updates to ZERO_PAGE, so use that here also.
      Plus add comment suggested by KAMEZAWA Hiroyuki.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e919717
    • Hugh Dickins's avatar
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins authored
      
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • Hugh Dickins's avatar
      mm: munlock use follow_page · 408e82b7
      Hugh Dickins authored
      
      Hiroaki Wakabayashi points out that when mlock() has been interrupted
      by SIGKILL, the subsequent munlock() takes unnecessarily long because
      its use of __get_user_pages() insists on faulting in all the pages
      which mlock() never reached.
      
      It's worse than slowness if mlock() is terminated by Out Of Memory kill:
      the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
      pages which mlock() could not find memory for; so innocent bystanders are
      killed too, and perhaps the system hangs.
      
      __get_user_pages() does a lot that's silly for munlock(): so remove the
      munlock option from __mlock_vma_pages_range(), and use a simple loop of
      follow_page()s in munlock_vma_pages_range() instead; ignoring absent
      pages, and not marking present pages as accessed or dirty.
      
      (Change munlock() to only go so far as mlock() reached?  That does not
      work out, given the convention that mlock() claims complete success even
      when it has to give up early - in part so that an underlying file can be
      extended later, and those pages locked which earlier would give SIGBUS.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: <stable@kernel.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarHiroaki Wakabayashi <primulaelatior@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      408e82b7
  17. 17 Jun, 2009 1 commit
  18. 24 Apr, 2009 1 commit
  19. 08 Apr, 2009 1 commit
  20. 07 Apr, 2009 1 commit
    • Markus Metzger's avatar
      mm, x86, ptrace, bts: defer branch trace stopping · e2b371f0
      Markus Metzger authored
      
      When a ptraced task is unlinked, we need to stop branch tracing for
      that task.
      
      Since the unlink is called with interrupts disabled, and we need
      interrupts enabled to stop branch tracing, we defer the work.
      
      Collect all branch tracing related stuff in a branch tracing context.
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarMarkus Metzger <markus.t.metzger@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: roland@redhat.com
      Cc: eranian@googlemail.com
      Cc: juan.villacis@intel.com
      Cc: ak@linux.jf.intel.com
      LKML-Reference: <20090403144550.712401000@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e2b371f0
  21. 11 Feb, 2009 1 commit
  22. 08 Feb, 2009 1 commit
  23. 01 Feb, 2009 1 commit
    • Linus Torvalds's avatar
      Manually revert "mlock: downgrade mmap sem while populating mlocked regions" · 27421e21
      Linus Torvalds authored
      This essentially reverts commit 8edb08ca
      
      .
      
      It downgraded our mmap semaphore to a read-lock while mlocking pages, in
      order to allow other threads (and external accesses like "ps" et al) to
      walk the vma lists and take page faults etc.  Which is a nice idea, but
      the implementation does not work.
      
      Because we cannot upgrade the lock back to a write lock without
      releasing the mmap semaphore, the code had to release the lock entirely
      and then re-take it as a writelock.  However, that meant that the caller
      possibly lost the vma chain that it was following, since now another
      thread could come in and mmap/munmap the range.
      
      The code tried to work around that by just looking up the vma again and
      erroring out if that happened, but quite frankly, that was just a buggy
      hack that doesn't actually protect against anything (the other thread
      could just have replaced the vma with another one instead of totally
      unmapping it).
      
      The only way to downgrade to a read map _reliably_ is to do it at the
      end, which is likely the right thing to do: do all the 'vma' operations
      with the write-lock held, then downgrade to a read after completing them
      all, and then do the "populate the newly mlocked regions" while holding
      just the read lock.  And then just drop the read-lock and return to user
      space.
      
      The (perhaps somewhat simpler) alternative is to just make all the
      callers of mlock_vma_pages_range() know that the mmap lock got dropped,
      and just re-grab the mmap semaphore if it needs to mlock more than one
      vma region.
      
      So we can do this "downgrade mmap sem while populating mlocked regions"
      thing right, but the way it was done here was absolutely not correct.
      Thus the revert, in the expectation that we will do it all correctly
      some day.
      
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27421e21
  24. 14 Jan, 2009 2 commits
  25. 06 Jan, 2009 1 commit
    • Ying Han's avatar
      mm: make get_user_pages() interruptible · 4779280d
      Ying Han authored
      
      The initial implementation of checking TIF_MEMDIE covers the cases of OOM
      killing.  If the process has been OOM killed, the TIF_MEMDIE is set and it
      return immediately.  This patch includes:
      
      1.  add the case that the SIGKILL is sent by user processes.  The
         process can try to get_user_pages() unlimited memory even if a user
         process has sent a SIGKILL to it(maybe a monitor find the process
         exceed its memory limit and try to kill it).  In the old
         implementation, the SIGKILL won't be handled until the get_user_pages()
         returns.
      
      2.  change the return value to be ERESTARTSYS.  It makes no sense to
         return ENOMEM if the get_user_pages returned by getting a SIGKILL
         signal.  Considering the general convention for a system call
         interrupted by a signal is ERESTARTNOSYS, so the current return value
         is consistant to that.
      
      Lee:
      
      An unfortunate side effect of "make-get_user_pages-interruptible" is that
      it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
      resulting in freeing of mlocked pages.  Freeing of mlocked pages, in
      itself, is not so bad.  We just count them now--altho' I had hoped to
      remove this stat and add PG_MLOCKED to the free pages flags check.
      
      However, consider pages in shared libraries mapped by more than one task
      that a task mlocked--e.g., via mlockall().  If the task that mlocked the
      pages exits via SIGKILL, these pages would be left mlocked and
      unevictable.
      
      Proposed fix:
      
      Add another GUP flag to ignore sigkill when calling get_user_pages from
      munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
      the same purpose.  We are not actually allocating memory in this case,
      which "make-get_user_pages-interruptible" intends to avoid.  We're just
      munlocking pages that are already resident and mapped, and we're reusing
      get_user_pages() to access those pages.
      
      ??  Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
      into a single flag: GUP_FLAGS_MUNLOCK ???
      
      [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4779280d
  26. 20 Dec, 2008 1 commit
  27. 16 Nov, 2008 1 commit
  28. 13 Nov, 2008 1 commit
    • KOSAKI Motohiro's avatar
      mm: remove lru_add_drain_all() from the munlock path · 8891d6da
      KOSAKI Motohiro authored
      
      lockdep warns about following message at boot time on one of my test
      machine.  Then, schedule_on_each_cpu() sholdn't be called when the task
      have mmap_sem.
      
      Actually, lru_add_drain_all() exist to prevent the unevictalble pages
      stay on reclaimable lru list.  but currenct unevictable code can rescue
      unevictable pages although it stay on reclaimable list.
      
      So removing is better.
      
      In addition, this patch add lru_add_drain_all() to sys_mlock() and
      sys_mlockall().  it isn't must.  but it reduce the failure of moving to
      unevictable list.  its failure can rescue in vmscan later.  but reducing
      is better.
      
      Note, if above rescuing happend, the Mlocked and the Unevictable field
      mismatching happend in /proc/meminfo.  but it doesn't cause any real
      trouble.
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.28-rc2-mm1 #2
      -------------------------------------------------------
      lvm/1103 is trying to acquire lock:
       (&cpu_hotplug.lock){--..}, at: [<c0130789>] get_online_cpus+0x29/0x50
      
      but task is already holding lock:
       (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #3 (&mm->mmap_sem){----}:
             [<c0153da2>] check_noncircular+0x82/0x110
             [<c0185e6a>] might_fault+0x4a/0xa0
             [<c0156161>] validate_chain+0xb11/0x1070
             [<c0185e6a>] might_fault+0x4a/0xa0
             [<c0156923>] __lock_acquire+0x263/0xa10
             [<c015714c>] lock_acquire+0x7c/0xb0			(*) grab mmap_sem
             [<c0185e6a>] might_fault+0x4a/0xa0
             [<c0185e9b>] might_fault+0x7b/0xa0
             [<c0185e6a>] might_fault+0x4a/0xa0
             [<c0294dd0>] copy_to_user+0x30/0x60
             [<c01ae3ec>] filldir+0x7c/0xd0
             [<c01e3a6a>] sysfs_readdir+0x11a/0x1f0			(*) grab sysfs_mutex
             [<c01ae370>] filldir+0x0/0xd0
             [<c01ae370>] filldir+0x0/0xd0
             [<c01ae4c6>] vfs_readdir+0x86/0xa0			(*) grab i_mutex
             [<c01ae75b>] sys_getdents+0x6b/0xc0
             [<c010355a>] syscall_call+0x7/0xb
             [<ffffffff>] 0xffffffff
      
      -> #2 (sysfs_mutex){--..}:
             [<c0153da2>] check_noncircular+0x82/0x110
             [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
             [<c0156161>] validate_chain+0xb11/0x1070
             [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
             [<c0156923>] __lock_acquire+0x263/0xa10
             [<c015714c>] lock_acquire+0x7c/0xb0			(*) grab sysfs_mutex
             [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
             [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0
             [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
             [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
             [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
             [<c01e422f>] create_dir+0x3f/0x90
             [<c01e42a9>] sysfs_create_dir+0x29/0x50
             [<c04faaf5>] _spin_unlock+0x25/0x40
             [<c028f21d>] kobject_add_internal+0xcd/0x1a0
             [<c028f37a>] kobject_set_name_vargs+0x3a/0x50
             [<c028f41d>] kobject_init_and_add+0x2d/0x40
             [<c019d4d2>] sysfs_slab_add+0xd2/0x180
             [<c019d580>] sysfs_add_func+0x0/0x70
             [<c019d5dc>] sysfs_add_func+0x5c/0x70			(*) grab slub_lock
             [<c01400f2>] run_workqueue+0x172/0x200
             [<c014008f>] run_workqueue+0x10f/0x200
             [<c0140bd0>] worker_thread+0x0/0xf0
             [<c0140c6c>] worker_thread+0x9c/0xf0
             [<c0143c80>] autoremove_wake_function+0x0/0x50
             [<c0140bd0>] worker_thread+0x0/0xf0
             [<c0143972>] kthread+0x42/0x70
             [<c0143930>] kthread+0x0/0x70
             [<c01042db>] kernel_thread_helper+0x7/0x1c
             [<ffffffff>] 0xffffffff
      
      -> #1 (slub_lock){----}:
             [<c0153d2d>] check_noncircular+0xd/0x110
             [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
             [<c0156161>] validate_chain+0xb11/0x1070
             [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
             [<c015433d>] mark_lock+0x35d/0xd00
             [<c0156923>] __lock_acquire+0x263/0xa10
             [<c015714c>] lock_acquire+0x7c/0xb0
             [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
             [<c04f93a3>] down_read+0x43/0x80
             [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0		(*) grab slub_lock
             [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
             [<c04fd9ac>] notifier_call_chain+0x3c/0x70
             [<c04f5454>] _cpu_up+0x84/0x110
             [<c04f552b>] cpu_up+0x4b/0x70				(*) grab cpu_hotplug.lock
             [<c06d1530>] kernel_init+0x0/0x170
             [<c06d15e5>] kernel_init+0xb5/0x170
             [<c06d1530>] kernel_init+0x0/0x170
             [<c01042db>] kernel_thread_helper+0x7/0x1c
             [<ffffffff>] 0xffffffff
      
      -> #0 (&cpu_hotplug.lock){--..}:
             [<c0155bff>] validate_chain+0x5af/0x1070
             [<c040f7e0>] dev_status+0x0/0x50
             [<c0156923>] __lock_acquire+0x263/0xa10
             [<c015714c>] lock_acquire+0x7c/0xb0
             [<c0130789>] get_online_cpus+0x29/0x50
             [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0
             [<c0130789>] get_online_cpus+0x29/0x50
             [<c0130789>] get_online_cpus+0x29/0x50
             [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10
             [<c0130789>] get_online_cpus+0x29/0x50			(*) grab cpu_hotplug.lock
             [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0
             [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0
             [<c0156945>] __lock_acquire+0x285/0xa10
             [<c0188f09>] vma_merge+0xa9/0x1d0
             [<c0187450>] mlock_fixup+0x180/0x200
             [<c0187548>] do_mlockall+0x78/0x90			(*) grab mmap_sem
             [<c01878e1>] sys_mlockall+0x81/0xb0
             [<c010355a>] syscall_call+0x7/0xb
             [<ffffffff>] 0xffffffff
      
      other info that might help us debug this:
      
      1 lock held by lvm/1103:
       #0:  (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0
      
      stack backtrace:
      Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
      Call Trace:
       [<c01555fc>] print_circular_bug_tail+0x7c/0xd0
       [<c0155bff>] validate_chain+0x5af/0x1070
       [<c040f7e0>] dev_status+0x0/0x50
       [<c0156923>] __lock_acquire+0x263/0xa10
       [<c015714c>] lock_acquire+0x7c/0xb0
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0
       [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0
       [<c0156945>] __lock_acquire+0x285/0xa10
       [<c0188f09>] vma_merge+0xa9/0x1d0
       [<c0187450>] mlock_fixup+0x180/0x200
       [<c0187548>] do_mlockall+0x78/0x90
       [<c01878e1>] sys_mlockall+0x81/0xb0
       [<c010355a>] syscall_call+0x7/0xb
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8891d6da
  29. 20 Oct, 2008 2 commits