Damn Technology: April 2010

A soft lockup is the symptom of a task or kernel thread using and not releasing a CPU for a period of time (the softlockup_thresh setting ). User space processes should not be able to soft lockup a CPU.

How is a soft lockup detected ?

The Linux kernel creates a watchdog process for each CPU in the system. This should be visible using in the standard "ps" command, this is shown as [watchdog/N], where N is the number of the logical CPU.

This watchdog process/thread wakes up once a second, gets the current time stamp for the CPU it is running on and saves it into a per-CPU structure.

The softlockup_tick() function that gets called from the timer interrupt(). This function gets the current time stamp for the specified CPU and compares it to the saved one in the per-CPU structure.

If the current time stamp is more than softlockup_thresh seconds later than the saved time stamp then it's because the watchdog thread has not been run recently and a soft lockup message is generated on the console. As the timer counter on each CPU can be slightly slow or fast the counter is compared to the previous tick.

Why was the process/task allowed to hog the CPU ?

Not all code that runs on the CPU is considered to be a process. Tasklets / Interrupt handlers and blah are kernel functions that do not show up in the standard process listing (with the ps command).

The kernel still allocates them CPU time, but trusts them to relinquish control of the CPU.

Common causes of not relinquishing control.

Software bugs can cause the process/code to not relinquish the control of the CPU. The code could be waiting on a lock, or may be running code which continues in an infinite loop.

The other problem may be that the scheduler has ignored the process and the process has not been removed from the CPU for some timeframe. There are a number of kernel options that can isolate a cpu and run a single process on it. As the process "hogs" the CPU it may be seen as causing a soft lockup.

Example of a user space "soft lockup"

Pid: 2642, comm: pausetest Not tainted 2.6.18-164.el5 #1
RIP: 0033:[<00000000004005c2>] [<00000000004005c2>]
RSP: 002b:00007fff8a5d7028 EFLAGS: 00000246
RAX: 0000000093338370 RBX: 0000000000000028 RCX: 0000000093338340
RDX: 0000000000000140 RSI: 00002b56f6317000 RDI: 0000000000000001
RBP: 00002b56f6317000 R08: 00000000ffffffff R09: 000000302f919840
R10: 0000000000000000 R11: 0000000000000246 R12: 000000302fb51780
R13: 0000000000000034 R14: 00000000004007da R15: 0000000000000001
FS: 00002b56f6329210(0000) GS:ffff81001fc147c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000302f8c5f70 CR3: 000000000db79000 CR4: 00000000000006e0

Call Trace:

Note the lack of values beneath "Call Trace:" above. If a userspace process is the process that has not been scheduled/removed from the cpu no backtrace will be shown, only the registers and the process name.

Example of a kernel space "lockup".

BUG: soft lockup detected on CPU#0!

Call Trace:
[] show_trace+0x34/0x47
[] dump_stack+0x12/0x17
[] softlockup_tick+0xdb/0xf6
[] update_process_times+0x42/0x68
[] smp_local_timer_interrupt+0x23/0x47
[] smp_apic_timer_interrupt+0x41/0x47
[] apic_timer_interrupt+0x66/0x6c
DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Leftover inexact backtrace:
[] :ext3:htree_dirblock_to_tree+0xa8/0xc9
[] :ext3:htree_dirblock_to_tree+0x72/0xc9
[] :ext3:ext3_htree_fill_tree+0x7c/0x1cb
[] :ext3:ext3_readdir+0x1a9/0x4da
[] filldir+0x0/0xb7
[] file_has_perm+0x94/0xa3
[] filldir+0x0/0xb7
[] vfs_readdir+0x77/0xa9
[] sys_getdents+0x75/0xbd
[] tracesys+0x71/0xdc
[] tracesys+0xd1/0xdc

If a kernel thread is responsible the backtrace for the kernel thread will be shown in the soft lockup message, however the backtrace includes code that shows the stack message. The DWARF2 undwinder attempts to show the actual stack trace. (shown above in bold).

Sometimes the call trace may only show the symptom of the soft lockup, and other kernel tasks will need to be investigated to show the cause.

In some circumstances when attempting to debug a vmcore from Linux, you may have only been able to get part of the vmcore, either due a technical issue or the machine being forcebly rebooted before completion.

When loading the vmcore in crash you may find something similar to the message below:

WARNING: yourvmcorefilename: may be truncated or incomplete
         PT_LOAD p_offset: 3358304032
                 p_filesz: 13690204160
           bytes required: 17048508192
            dumpfile size: 5412478976

crash: read error: kernel virtual address: ffff81042ff0eac0 type: "cpu_pda entry"
crash: read error: kernel virtual address: ffff810230005458 type: "pglist node_id"

A workaround to get some data in this case is to use a recent crash and use the --minimal parameter.

crash --minimal vmcorefilehere vmlinux

--snip --

NOTE: minimal mode commands: log, dis, rd, sym, eval and exit

This will present you with a limited set of commands that you can use that will allow you to get basic data from the vmcore.

It is important to note that backtraces are not entirely accurate and that when you inspect memory it may not be available. Crash usually throws an error when this is the case.

This is not an april fools joke.

Damn Technology

Tuesday, April 6, 2010

Linux Crash debug tips - I have a soft lockup, what is causing it ?

Thursday, April 1, 2010

Linux Crash debug tips - I have a partial or "incomplete" vmcore, what can I find out ?

About us