1. 29 Aug, 2009 1 commit
    • Xiao Guangrong's avatar
      itimers: Add tracepoints for itimer · 3f0a525e
      Xiao Guangrong authored
      
      Add tracepoints for all itimer variants: ITIMER_REAL, ITIMER_VIRTUAL
      and ITIMER_PROF.
      
      [ tglx: Fixed comments and made the output more readable, parseable
        	and consistent. Replaced pid_vnr by pid_nr because the hrtimer
        	callback can happen in any namespace ]
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Zhaolei <zhaolei@cn.fujitsu.com>
      LKML-Reference: <4A7F8B6E.2010109@cn.fujitsu.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      3f0a525e
  2. 03 Aug, 2009 3 commits
  3. 05 Feb, 2009 1 commit
    • Peter Zijlstra's avatar
      timers: split process wide cpu clocks/timers · 4cd4c1b4
      Peter Zijlstra authored
      
      Change the process wide cpu timers/clocks so that we:
      
       1) don't mess up the kernel with too many threads,
       2) don't have a per-cpu allocation for each process,
       3) have no impact when not used.
      
      In order to accomplish this we're going to split it into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
      
       - timers; which need constant time sampling but since they're
                 explicity used, the user can pay the overhead.
      
      The clock readout will go back to a full sum of the thread group, while the
      timers will run of a global 'clock' that only runs when needed, so only
      programs that make use of the facility pay the price.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      4cd4c1b4
  4. 14 Jan, 2009 2 commits
  5. 14 Sep, 2008 1 commit
    • Frank Mayhar's avatar
      timers: fix itimer/many thread hang · f06febc9
      Frank Mayhar authored
      
      Overview
      
      This patch reworks the handling of POSIX CPU timers, including the
      ITIMER_PROF, ITIMER_VIRT timers and rlimit handling.  It was put together
      with the help of Roland McGrath, the owner and original writer of this code.
      
      The problem we ran into, and the reason for this rework, has to do with using
      a profiling timer in a process with a large number of threads.  It appears
      that the performance of the old implementation of run_posix_cpu_timers() was
      at least O(n*3) (where "n" is the number of threads in a process) or worse.
      Everything is fine with an increasing number of threads until the time taken
      for that routine to run becomes the same as or greater than the tick time, at
      which point things degrade rather quickly.
      
      This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."
      
      Code Changes
      
      This rework corrects the implementation of run_posix_cpu_timers() to make it
      run in constant time for a particular machine.  (Performance may vary between
      one machine and another depending upon whether the kernel is built as single-
      or multiprocessor and, in the latter case, depending upon the number of
      running processors.)  To do this, at each tick we now update fields in
      signal_struct as well as task_struct.  The run_posix_cpu_timers() function
      uses those fields to make its decisions.
      
      We define a new structure, "task_cputime," to contain user, system and
      scheduler times and use these in appropriate places:
      
      struct task_cputime {
      	cputime_t utime;
      	cputime_t stime;
      	unsigned long long sum_exec_runtime;
      };
      
      This is included in the structure "thread_group_cputime," which is a new
      substructure of signal_struct and which varies for uniprocessor versus
      multiprocessor kernels.  For uniprocessor kernels, it uses "task_cputime" as
      a simple substructure, while for multiprocessor kernels it is a pointer:
      
      struct thread_group_cputime {
      	struct task_cputime totals;
      };
      
      struct thread_group_cputime {
      	struct task_cputime *totals;
      };
      
      We also add a new task_cputime substructure directly to signal_struct, to
      cache the earliest expiration of process-wide timers, and task_cputime also
      replaces the it_*_expires fields of task_struct (used for earliest expiration
      of thread timers).  The "thread_group_cputime" structure contains process-wide
      timers that are updated via account_user_time() and friends.  In the non-SMP
      case the structure is a simple aggregator; unfortunately in the SMP case that
      simplicity was not achievable due to cache-line contention between CPUs (in
      one measured case performance was actually _worse_ on a 16-cpu system than
      the same test on a 4-cpu system, due to this contention).  For SMP, the
      thread_group_cputime counters are maintained as a per-cpu structure allocated
      using alloc_percpu().  The timer functions update only the timer field in
      the structure corresponding to the running CPU, obtained using per_cpu_ptr().
      
      We define a set of inline functions in sched.h that we use to maintain the
      thread_group_cputime structure and hide the differences between UP and SMP
      implementations from the rest of the kernel.  The thread_group_cputime_init()
      function initializes the thread_group_cputime structure for the given task.
      The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
      out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
      in the per-cpu structures and fields.  The thread_group_cputime_free()
      function, also a no-op for UP, in SMP frees the per-cpu structures.  The
      thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
      thread_group_cputime_alloc() if the per-cpu structures haven't yet been
      allocated.  The thread_group_cputime() function fills the task_cputime
      structure it is passed with the contents of the thread_group_cputime fields;
      in UP it's that simple but in SMP it must also safely check that tsk->signal
      is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
      if so, sums the per-cpu values for each online CPU.  Finally, the three
      functions account_group_user_time(), account_group_system_time() and
      account_group_exec_runtime() are used by timer functions to update the
      respective fields of the thread_group_cputime structure.
      
      Non-SMP operation is trivial and will not be mentioned further.
      
      The per-cpu structure is always allocated when a task creates its first new
      thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
      It is freed at process exit via a call to thread_group_cputime_free() from
      cleanup_signal().
      
      All functions that formerly summed utime/stime/sum_sched_runtime values from
      from all threads in the thread group now use thread_group_cputime() to
      snapshot the values in the thread_group_cputime structure or the values in
      the task structure itself if the per-cpu structure hasn't been allocated.
      
      Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
      The run_posix_cpu_timers() function has been split into a fast path and a
      slow path; the former safely checks whether there are any expired thread
      timers and, if not, just returns, while the slow path does the heavy lifting.
      With the dedicated thread group fields, timers are no longer "rebalanced" and
      the process_timer_rebalance() function and related code has gone away.  All
      summing loops are gone and all code that used them now uses the
      thread_group_cputime() inline.  When process-wide timers are set, the new
      task_cputime structure in signal_struct is used to cache the earliest
      expiration; this is checked in the fast path.
      
      Performance
      
      The fix appears not to add significant overhead to existing operations.  It
      generally performs the same as the current code except in two cases, one in
      which it performs slightly worse (Case 5 below) and one in which it performs
      very significantly better (Case 2 below).  Overall it's a wash except in those
      two cases.
      
      I've since done somewhat more involved testing on a dual-core Opteron system.
      
      Case 1: With no itimer running, for a test with 100,000 threads, the fixed
      	kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
      	all of which was spent in the system.  There were twice as many
      	voluntary context switches with the fix as without it.
      
      Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
      	an unmodified kernel can handle), the fixed kernel ran the test in
      	eight percent of the time (5.8 seconds as opposed to 70 seconds) and
      	had better tick accuracy (.012 seconds per tick as opposed to .023
      	seconds per tick).
      
      Case 3: A 4000-thread test with an initial timer tick of .01 second and an
      	interval of 10,000 seconds (i.e. a timer that ticks only once) had
      	very nearly the same performance in both cases:  6.3 seconds elapsed
      	for the fixed kernel versus 5.5 seconds for the unfixed kernel.
      
      With fewer threads (eight in these tests), the Case 1 test ran in essentially
      the same time on both the modified and unmodified kernels (5.2 seconds versus
      5.8 seconds).  The Case 2 test ran in about the same time as well, 5.9 seconds
      versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
      tick versus .025 seconds per tick for the unmodified kernel.
      
      Since the fix affected the rlimit code, I also tested soft and hard CPU limits.
      
      Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
      	running), the modified kernel was very slightly favored in that while
      	it killed the process in 19.997 seconds of CPU time (5.002 seconds of
      	wall time), only .003 seconds of that was system time, the rest was
      	user time.  The unmodified kernel killed the process in 20.001 seconds
      	of CPU (5.014 seconds of wall time) of which .016 seconds was system
      	time.  Really, though, the results were too close to call.  The results
      	were essentially the same with no itimer running.
      
      Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
      	(where the hard limit would never be reached) and an itimer running,
      	the modified kernel exhibited worse tick accuracy than the unmodified
      	kernel: .050 seconds/tick versus .028 seconds/tick.  Otherwise,
      	performance was almost indistinguishable.  With no itimer running this
      	test exhibited virtually identical behavior and times in both cases.
      
      In times past I did some limited performance testing.  those results are below.
      
      On a four-cpu Opteron system without this fix, a sixteen-thread test executed
      in 3569.991 seconds, of which user was 3568.435s and system was 1.556s.  On
      the same system with the fix, user and elapsed time were about the same, but
      system time dropped to 0.007 seconds.  Performance with eight, four and one
      thread were comparable.  Interestingly, the timer ticks with the fix seemed
      more accurate:  The sixteen-thread test with the fix received 149543 ticks
      for 0.024 seconds per tick, while the same test without the fix received 58720
      for 0.061 seconds per tick.  Both cases were configured for an interval of
      0.01 seconds.  Again, the other tests were comparable.  Each thread in this
      test computed the primes up to 25,000,000.
      
      I also did a test with a large number of threads, 100,000 threads, which is
      impossible without the fix.  In this case each thread computed the primes only
      up to 10,000 (to make the runtime manageable).  System time dominated, at
      1546.968 seconds out of a total 2176.906 seconds (giving a user time of
      629.938s).  It received 147651 ticks for 0.015 seconds per tick, still quite
      accurate.  There is obviously no comparable test without the fix.
      Signed-off-by: default avatarFrank Mayhar <fmayhar@google.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f06febc9
  6. 08 Feb, 2008 1 commit
  7. 18 Oct, 2007 1 commit
  8. 08 May, 2007 2 commits
  9. 16 Feb, 2007 3 commits
  10. 26 Mar, 2006 2 commits
  11. 25 Mar, 2006 2 commits
    • Thomas Gleixner's avatar
      [PATCH] Validate and sanitze itimer timeval from userspace · 7d99b7d6
      Thomas Gleixner authored
      
      According to the specification the timevals must be validated and an
      errorcode -EINVAL returned in case the timevals are not in canonical form.
      This check was never done in Linux.
      
      The pre 2.6.16 code converted invalid timevals silently.  Negative timeouts
      were converted by the timeval_to_jiffies conversion to the maximum timeout.
      
      hrtimers and the ktime_t operations expect timevals in canonical form.
      Otherwise random results might happen on 32 bits machines due to the
      optimized ktime_add/sub operations.  Negative timeouts are treated as
      already expired.  This might break applications which work on pre 2.6.16.
      
      To prevent random behaviour and API breakage the timevals are checked and
      invalid timevals sanitized in a simliar way as the pre 2.6.16 code did.
      
      Invalid timevals are reported with a per boot limited number of kernel
      messages so applications which use this misfeature can be corrected.
      
      After a grace period of one year the sanitizing should be replaced by a
      correct validation check.  This is also documented in
      Documentation/feature-removal-schedule.txt
      
      The validation and sanitizing is done inside do_setitimer so all callers
      (sys_setitimer, compat_sys_setitimer, osf_setitimer) are catched.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7d99b7d6
    • Thomas Gleixner's avatar
      [PATCH] sys_alarm() unsigned signed conversion fixup · c08b8a49
      Thomas Gleixner authored
      
      alarm() calls the kernel with an unsigend int timeout in seconds.  The
      value is stored in the tv_sec field of a struct timeval to setup the
      itimer.  The tv_sec field of struct timeval is of type long, which causes
      the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX.
      
      Before the hrtimer merge (pre 2.6.16) such a negative value was converted
      to the maximum jiffies timeout by the timeval_to_jiffies conversion.  It's
      not clear whether this was intended or just happened to be done by the
      timeval_to_jiffies code.
      
      hrtimers expect a timeval in canonical form and treat a negative timeout as
      already expired.  This breaks the legitimate usage of alarm() with a
      timeout value > INT_MAX seconds.
      
      For 32 bit machines it is therefor necessary to limit the internal seconds
      value to avoid API breakage.  Instead of doing this in all implementations
      of sys_alarm the duplicated sys_alarm code is moved into a common function
      in itimer.c
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c08b8a49
  12. 01 Feb, 2006 2 commits
  13. 10 Jan, 2006 1 commit
  14. 27 Jul, 2005 1 commit
  15. 29 Jun, 2005 1 commit
  16. 05 May, 2005 1 commit
    • Paulo Marques's avatar
      [PATCH] setitimer timer expires too early · b7e4e853
      Paulo Marques authored
      
      It seems that the code responsible for this is in kernel/itimer.c:126:
      
      	p->signal->real_timer.expires = jiffies + interval;
      	add_timer(&p->signal->real_timer);
      
      If you request an interval of, lets say 900 usecs, the interval given by
      timeval_to_jiffies will be 1.
      
      If you request this when we are half-way between two timer ticks, the
      interval will only give 400 usecs.
      
      If we want to guarantee that we never ever give intervals less than
      requested, the simple solution would be to change that to:
      
      	p->signal->real_timer.expires = jiffies + interval + 1;
      
      This however will produce pathological cases, like having a idle system
      being requested 1 ms timeouts will give systematically 2 ms timeouts,
      whereas currently it simply gives a few usecs less than 1 ms.
      
      The complex (and more computationally expensive) solution would be to
      check the gettimeofday time, and compute the correct number of jiffies.
      This way, if we request a 300 usecs timer 200 usecs inside the timer
      tick, we can wait just one tick, but not if we are 800 usecs inside the
      tick. This would also mean that we would have to lock preemption during
      these computations to avoid races, etc.
      
      I've searched the archives but couldn't find this particular issue being
      discussed before.
      
      Attached is a patch to do the simple solution, in case anybody thinks
      that it should be used.
      Signed-Off-By: default avatarPaulo Marques <pmarques@grupopie.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b7e4e853
  17. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4