1. 17 Dec, 2009 1 commit
    • Oleg Nesterov's avatar
      do_wait() optimization: do not place sub-threads on task_struct->children list · 9cd80bbb
      Oleg Nesterov authored
      
      Thanks to Roland who pointed out de_thread() issues.
      
      Currently we add sub-threads to ->real_parent->children list.  This buys
      nothing but slows down do_wait().
      
      With this patch ->children contains only main threads (group leaders).
      The only complication is that forget_original_parent() should iterate over
      sub-threads by hand, and de_thread() needs another list_replace() when it
      changes ->group_leader.
      
      Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
      tasks, we can remove this check (and we can unify do_wait_thread() and
      ptrace_do_wait()).
      
      This change can confuse the optimistic search in mm_update_next_owner(),
      but this is fixable and minor.
      
      Perhaps badness() and oom_kill_process() should be updated, but they
      should be fixed in any case.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cd80bbb
  2. 15 Dec, 2009 1 commit
    • john stultz's avatar
      procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm · 4614a696
      john stultz authored
      
      Setting a thread's comm to be something unique is a very useful ability
      and is helpful for debugging complicated threaded applications.  However
      currently the only way to set a thread name is for the thread to name
      itself via the PR_SET_NAME prctl.
      
      However, there may be situations where it would be advantageous for a
      thread dispatcher to be naming the threads its managing, rather then
      having the threads self-describe themselves.  This sort of behavior is
      available on other systems via the pthread_setname_np() interface.
      
      This patch exports a task's comm via proc/pid/comm and
      proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
      these values.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Mike Fulton <fultonm@ca.ibm.com>
      Cc: Sean Foley <Sean_Foley@ca.ibm.com>
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4614a696
  3. 12 Nov, 2009 1 commit
  4. 25 Oct, 2009 1 commit
  5. 24 Sep, 2009 5 commits
  6. 23 Sep, 2009 2 commits
    • Stefani Seibold's avatar
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold authored
      
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: default avatarStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
    • Jiri Pirko's avatar
      getrusage: fill ru_maxrss value · 1f10206c
      Jiri Pirko authored
      
      Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
      mark.  This struct is filled as a parameter to getrusage syscall.
      ->ru_maxrss value is set to KBs which is the way it is done in BSD
      systems.  /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
      which seems to be incorrect behavior.  Maintainer of this util was
      notified by me with the patch which corrects it and cc'ed.
      
      To make this happen we extend struct signal_struct by two fields.  The
      first one is ->maxrss which we use to store rss hiwater of the task.  The
      second one is ->cmaxrss which we use to store highest rss hiwater of all
      task childs.  These values are used in k_getrusage() to actually fill
      ->ru_maxrss.  k_getrusage() uses current rss hiwater value directly if mm
      struct exists.
      
      Note:
      exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
      it is intetionally behavior. *BSD getrusage have exec() inheriting.
      
      test programs
      ========================================================
      
      getrusage.c
      ===========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
      
       #define err(str) perror(str), exit(1)
      
      int main(int argc, char** argv)
      {
      	int status;
      
      	printf("allocate 100MB\n");
      	consume(100);
      
      	printf("testcase1: fork inherit? \n");
      	printf("  expect: initial.self ~= child.self\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase2: fork inherit? (cont.) \n");
      	printf("  expect: initial.children ~= 100MB, but child.children = 0\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase3: fork + malloc \n");
      	printf("  expect: child.self ~= initial.self + 50MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		printf("allocate +50MB\n");
      		consume(50);
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase4: grandchild maxrss\n");
      	printf("  expect: post_wait.children ~= 300MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 0 -g 300");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase5: zombie\n");
      	printf("  expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
      	printf("          post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
      	show_rusage("initial");
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("pre_wait");
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 400");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase6: SIG_IGN\n");
      	printf("  expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
      	show_rusage("initial");
      	signal(SIGCHLD, SIG_IGN);
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("after_zombie");
      	} else {
      		system("./child -n 500");
      		_exit(0);
      	}
      	printf("\n");
      	signal(SIGCHLD, SIG_DFL);
      
      	printf("testcase7: exec (without fork) \n");
      	printf("  expect: initial ~= exec \n");
      	show_rusage("initial");
      	execl("./child", "child", "-v", NULL);
      
      	return 0;
      }
      
      child.c
      =======
       #include <sys/types.h>
       #include <unistd.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
      
       #include "common.h"
      
      int main(int argc, char** argv)
      {
      	int status;
      	int c;
      	long consume_size = 0;
      	long grandchild_consume_size = 0;
      	int show = 0;
      
      	while ((c = getopt(argc, argv, "n:g:v")) != -1) {
      		switch (c) {
      		case 'n':
      			consume_size = atol(optarg);
      			break;
      		case 'v':
      			show = 1;
      			break;
      		case 'g':
      
      			grandchild_consume_size = atol(optarg);
      			break;
      		default:
      			break;
      		}
      	}
      
      	if (show)
      		show_rusage("exec");
      
      	if (consume_size) {
      		printf("child alloc %ldMB\n", consume_size);
      		consume(consume_size);
      	}
      
      	if (grandchild_consume_size) {
      		if (fork()) {
      			wait(&status);
      		} else {
      			printf("grandchild alloc %ldMB\n", grandchild_consume_size);
      			consume(grandchild_consume_size);
      
      			exit(0);
      		}
      	}
      
      	return 0;
      }
      
      common.c
      ========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
       #define err(str) perror(str), exit(1)
      
      void show_rusage(char *prefix)
      {
          	int err, err2;
          	struct rusage rusage_self;
          	struct rusage rusage_children;
      
          	printf("%s: ", prefix);
          	err = getrusage(RUSAGE_SELF, &rusage_self);
          	if (!err)
          		printf("self %ld ", rusage_self.ru_maxrss);
          	err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
          	if (!err2)
          		printf("children %ld ", rusage_children.ru_maxrss);
      
          	printf("\n");
      }
      
      /* Some buggy OS need this worthless CPU waste. */
      void make_pagefault(void)
      {
      	void *addr;
      	int size = getpagesize();
      	int i;
      
      	for (i=0; i<1000; i++) {
      		addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
      		if (addr == MAP_FAILED)
      			err("make_pagefault");
      		memset(addr, 0, size);
      		munmap(addr, size);
      	}
      }
      
      void consume(int mega)
      {
          	size_t sz = mega * 1024 * 1024;
          	void *ptr;
      
          	ptr = malloc(sz);
          	memset(ptr, 0, sz);
      	make_pagefault();
      }
      
      pid_t __fork(void)
      {
      	pid_t pid;
      
      	pid = fork();
      	make_pagefault();
      
      	return pid;
      }
      
      common.h
      ========
      void show_rusage(char *prefix);
      void make_pagefault(void);
      void consume(int mega);
      pid_t __fork(void);
      
      FreeBSD result (expected result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 103492 children 0
      fork child: self 103540 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 103540 children 103540
      child: self 103564 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 103564 children 103564
      allocate +50MB
      fork child: self 154860 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 103564 children 154860
      grandchild alloc 300MB
      post_wait: self 103564 children 308720
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 103564 children 308720
      child alloc 400MB
      pre_wait: self 103564 children 308720
      post_wait: self 103564 children 411312
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 103564 children 411312
      child alloc 500MB
      after_zombie: self 103624 children 411312
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 103624 children 411312
      exec: self 103624 children 411312
      
      Linux result (actual test result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 102848 children 0
      fork child: self 102572 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 102876 children 102644
      child: self 102572 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 102876 children 102644
      allocate +50MB
      fork child: self 153804 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 102876 children 153864
      grandchild alloc 300MB
      post_wait: self 102876 children 307536
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 102876 children 307536
      child alloc 400MB
      pre_wait: self 102876 children 307536
      post_wait: self 102876 children 410076
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 102876 children 410076
      child alloc 500MB
      after_zombie: self 102880 children 410076
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 102880 children 410076
      exec: self 102880 children 410076
      Signed-off-by: default avatarJiri Pirko <jpirko@redhat.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f10206c
  7. 21 Sep, 2009 1 commit
    • Ingo Molnar's avatar
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar authored
      
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: default avatarStephane Eranian <eranian@google.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarPaul Mackerras <paulus@samba.org>
      Reviewed-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      cdd6c482
  8. 05 Sep, 2009 1 commit
    • Oleg Nesterov's avatar
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov authored
      
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: default avatarTom Horsley <tom.horsley@att.net>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
  9. 24 Aug, 2009 1 commit
  10. 06 Jul, 2009 1 commit
  11. 21 May, 2009 1 commit
  12. 10 May, 2009 1 commit
  13. 09 May, 2009 2 commits
  14. 02 May, 2009 1 commit
    • Ivan Kokshaysky's avatar
      alpha: binfmt_aout fix · 74641f58
      Ivan Kokshaysky authored
      This fixes the problem introduced by commit 3bfacef4
      
       (get rid of
      special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
      when binfmt_aout built as module.  That happens because aout binary
      handler gets on the top of the binfmt list due to late registration, and
      kernel attempts to execute the binary without preparatory work that must
      be done by binfmt_loader.
      
      Fixed by changing the registration order of the default binfmt handlers
      using list_add_tail() and introducing insert_binfmt() function which
      places new handler on the top of the binfmt list.  This might be generally
      useful for installing arch-specific frontends for default handlers or just
      for overriding them.
      Signed-off-by: default avatarIvan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Richard Henderson <rth@twiddle.net
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74641f58
  15. 24 Apr, 2009 2 commits
    • Oleg Nesterov's avatar
      check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/ · 437f7fdb
      Oleg Nesterov authored
      
      write_lock(&current->fs->lock) guarantees we can't wrongly miss
      LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
      instead of ->siglock to iterate over the sub-threads. We must see
      all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
      takes fs->lock too.
      
      With or without this patch we can miss the freshly cloned thread
      and set LSM_UNSAFE_SHARE, we don't care.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      [ Fixed lock/unlock typo  - Hugh ]
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      437f7fdb
    • Oleg Nesterov's avatar
      do_execve() must not clear fs->in_exec if it was set by another thread · 8c652f96
      Oleg Nesterov authored
      
      If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
      unconditionally. This is wrong if we race with our sub-thread which
      also does do_execve:
      
      	Two threads T1 and T2 and another process P, all share the same
      	->fs.
      
      	T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
      	->fs is shared, we set LSM_UNSAFE but not ->in_exec.
      
      	P exits and decrements fs->users.
      
      	T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
      	shared, we set fs->in_exec.
      
      	T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
      	return to the user-space.
      
      	T1 does clone(CLONE_FS /* without CLONE_THREAD */).
      
      	T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
      	another process.
      
      Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
      do_execve() to clear ->in_exec depending on res.
      
      When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
      It can be set only if we don't share ->fs with another process, and since
      we already killed all sub-threads either ->in_exec == 0 or we are the
      only user of this ->fs.
      
      Also, we do not need fs->lock to clear fs->in_exec.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c652f96
  16. 08 Apr, 2009 1 commit
  17. 01 Apr, 2009 3 commits
  18. 29 Mar, 2009 1 commit
    • Hugh Dickins's avatar
      fix setuid sometimes doesn't · e426b64c
      Hugh Dickins authored
      
      Joe Malicki reports that setuid sometimes doesn't: very rarely,
      a setuid root program does not get root euid; and, by the way,
      they have a health check running lsof every few minutes.
      
      Right, check_unsafe_exec() notes whether the files_struct is being
      shared by more threads than will get killed by the exec, and if so
      sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
      But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
      use of get_files_struct(), which also raises that sharing count.
      
      There's a rather simple fix for this: exec's check on files->count
      has been redundant ever since 2.6.1 made it unshare_files() (except
      while compat_do_execve() omitted to do so) - just remove that check.
      
      [Note to -stable: this patch will not apply before 2.6.29: earlier
      releases should just remove the files->count line from unsafe_exec().]
      Reported-by: default avatarJoe Malicki <jmalicki@metacarta.com>
      Narrowed-down-by: default avatarMichael Itz <mitz@metacarta.com>
      Tested-by: default avatarJoe Malicki <jmalicki@metacarta.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e426b64c
  19. 12 Feb, 2009 1 commit
  20. 06 Feb, 2009 1 commit
    • David Howells's avatar
      CRED: Fix SUID exec regression · 0bf2f3ae
      David Howells authored
      The patch:
      
      	commit a6f76f23
      
      
      	CRED: Make execve() take advantage of copy-on-write credentials
      
      moved the place in which the 'safeness' of a SUID/SGID exec was performed to
      before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
      calculated incorrectly.  This flag is set if any of the usage counts for
      fs_struct, files_struct and sighand_struct are greater than 1 at the time the
      determination is made.  All of which are true for threads created by the
      pthread library.
      
      However, since we wish to make the security calculation before irrevocably
      damaging the process so that we can return it an error code in the case where
      we decide we want to reject the exec request on this basis, we have to make the
      determination before calling de_thread().
      
      So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
      our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
      (CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
      so can be discounted by check_unsafe_exec().
      
      We do have to be careful because CLONE_THREAD does not imply FS or FILES.
      
      We _assume_ that there will be no extra references to these structs held by the
      threads we're going to kill.
      
      This can be tested with the attached pair of programs.  Build the two programs
      using the Makefile supplied, and run ./test1 as a non-root user.  If
      successful, you should see something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=0 suid=0
      	SUCCESS - Correct effective user ID
      
      and if unsuccessful, something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=4043 suid=4043
      	ERROR - Incorrect effective user ID!
      
      The non-root user ID you see will depend on the user you run as.
      
      [test1.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <pthread.h>
      
      static void *thread_func(void *arg)
      {
      	while (1) {}
      }
      
      int main(int argc, char **argv)
      {
      	pthread_t tid;
      	uid_t uid, euid, suid;
      
      	printf("--TEST1--\n");
      	getresuid(&uid, &euid, &suid);
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
      		perror("pthread_create");
      		exit(1);
      	}
      
      	printf("exec ./test2\n");
      	execlp("./test2", "test2", NULL);
      	perror("./test2");
      	_exit(1);
      }
      
      [test2.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      int main(int argc, char **argv)
      {
      	uid_t uid, euid, suid;
      
      	getresuid(&uid, &euid, &suid);
      	printf("--TEST2--\n");
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (euid != 0) {
      		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
      		exit(1);
      	}
      	printf("SUCCESS - Correct effective user ID\n");
      	exit(0);
      }
      
      [Makefile]
      CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
      all: test1 test2
      
      test1: test1.c
      	gcc $(CFLAGS) -o test1 test1.c -lpthread
      
      test2: test2.c
      	gcc $(CFLAGS) -o test2 test2.c
      	sudo chown root.root test2
      	sudo chmod +s test2
      Reported-by: default avatarDavid Smith <dsmith@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarDavid Smith <dsmith@redhat.com>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      0bf2f3ae
  21. 05 Feb, 2009 1 commit
  22. 14 Jan, 2009 1 commit
  23. 06 Jan, 2009 3 commits
  24. 05 Jan, 2009 1 commit
  25. 03 Jan, 2009 1 commit
  26. 31 Dec, 2008 1 commit
  27. 16 Dec, 2008 1 commit
  28. 12 Dec, 2008 1 commit
  29. 10 Dec, 2008 1 commit
    • Roland McGrath's avatar
      tracehook: exec double-reporting fix · 85f33466
      Roland McGrath authored
      The patch 6341c393
      
       "tracehook: exec" introduced a small regression in
      2.6.27 regarding binfmt_misc exec event reporting.  Since the reporting
      is now done in the common search_binary_handler() function, an exec
      of a misc binary will result in two (or possibly multiple) exec events
      being reported, instead of just a single one, because the misc handler
      contains a recursive call to search_binary_handler.
      
      To add to the confusion, if PTRACE_O_TRACEEXEC is not active, the multiple
      SIGTRAP signals will in fact cause only a single ptrace intercept, as the
      signals are not queued.  However, if PTRACE_O_TRACEEXEC is on, the debugger
      will actually see multiple ptrace intercepts (PTRACE_EVENT_EXEC).
      
      The test program included below demonstrates the problem.
      
      This change fixes the bug by calling tracehook_report_exec() only in the
      outermost search_binary_handler() call (bprm->recursion_depth == 0).
      
      The additional change to restore bprm->recursion_depth after each binfmt
      load_binary call is actually superfluous for this bug, since we test the
      value saved on entry to search_binary_handler().  But it keeps the use of
      of the depth count to its most obvious expected meaning.  Depending on what
      binfmt handlers do in certain cases, there could have been false-positive
      tests for recursion limits before this change.
      
          /* Test program using PTRACE_O_TRACEEXEC.
             This forks and exec's the first argument with the rest of the arguments,
             while ptrace'ing.  It expects to see one PTRACE_EVENT_EXEC stop and
             then a successful exit, with no other signals or events in between.
      
             Test for kernel doing two PTRACE_EVENT_EXEC stops for a binfmt_misc exec:
      
             $ gcc -g traceexec.c -o traceexec
             $ sudo sh -c 'echo :test:M::foobar::/bin/cat: > /proc/sys/fs/binfmt_misc/register'
             $ echo 'foobar test' > ./foobar
             $ chmod +x ./foobar
             $ ./traceexec ./foobar; echo $?
             ==> good <==
             foobar test
             0
             $
             ==> bad <==
             foobar test
             unexpected status 0x4057f != 0
             3
             $
      
          */
      
          #include <stdio.h>
          #include <sys/types.h>
          #include <sys/wait.h>
          #include <sys/ptrace.h>
          #include <unistd.h>
          #include <signal.h>
          #include <stdlib.h>
      
          static void
          wait_for (pid_t child, int expect)
          {
            int status;
            pid_t p = wait (&status);
            if (p != child)
      	{
      	  perror ("wait");
      	  exit (2);
      	}
            if (status != expect)
      	{
      	  fprintf (stderr, "unexpected status %#x != %#x\n", status, expect);
      	  exit (3);
      	}
          }
      
          int
          main (int argc, char **argv)
          {
            pid_t child = fork ();
      
            if (child < 0)
      	{
      	  perror ("fork");
      	  return 127;
      	}
            else if (child == 0)
      	{
      	  ptrace (PTRACE_TRACEME);
      	  raise (SIGUSR1);
      	  execv (argv[1], &argv[1]);
      	  perror ("execve");
      	  _exit (127);
      	}
      
            wait_for (child, W_STOPCODE (SIGUSR1));
      
            if (ptrace (PTRACE_SETOPTIONS, child,
      		  0L, (void *) (long) PTRACE_O_TRACEEXEC) != 0)
      	{
      	  perror ("PTRACE_SETOPTIONS");
      	  return 4;
      	}
      
            if (ptrace (PTRACE_CONT, child, 0L, 0L) != 0)
      	{
      	  perror ("PTRACE_CONT");
      	  return 5;
      	}
      
            wait_for (child, W_STOPCODE (SIGTRAP | (PTRACE_EVENT_EXEC << 8)));
      
            if (ptrace (PTRACE_CONT, child, 0L, 0L) != 0)
      	{
      	  perror ("PTRACE_CONT");
      	  return 6;
      	}
      
            wait_for (child, W_EXITCODE (0, 0));
      
            return 0;
          }
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      CC: Ulrich Weigand <ulrich.weigand@de.ibm.com>
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      85f33466