环境信息:
WARNING: kernel version inconsistency between vmlinux and dumpfile
KERNEL: vmlinux-47.90
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Wed Nov 14 11:08:24 2018
UPTIME: 05:08:36
LOAD AVERAGE: 484.39, 481.11, 385.18
TASKS: 5287
NODENAME: ycby25-3kh_2
RELEASE: 3.0.101-0.47.90-default
VERSION: #1 SMP Wed Oct 19 14:11:00 UTC 2016 (56c73f1)
MACHINE: x86_64 (2600 Mhz)
MEMORY: 255.6 GB
PANIC: "[18477.566692] Kernel panic - not syncing: hung_task: blocked tasks"
PID: 144
COMMAND: "khungtaskd"
TASK: ffff881fc096e080 [THREAD_INFO: ffff881fc0970000]
CPU: 7
STATE: TASK_RUNNING (PANIC)
dmesg最后的log信息:
[17013.334105] show_signal_msg: 30 callbacks suppressed
[17013.334110] CMoniterThread[26144]: segfault at 0 ip 00007f13a100c699 sp 00007f138371fc30 error 6 in libCommonUtilitiesLib.so[7f13a0fdd000+
4d000]
[18477.566473] INFO: task dev_rdwriter:23683 blocked for more than 1200 seconds.
[18477.566475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[18477.566477] dev_rdwriter D 0000000000800000 0 23683 18968 0x00000000
[18477.566479] ffff88173e599d70 0000000000000082 ffff88173e598010 0000000000010900
[18477.566483] 0000000000010900 0000000000010900 0000000000010900 ffff88173e599fd8
[18477.566486] ffff88173e599fd8 0000000000010900 ffff88173e5964c0 ffff881fc3378580
[18477.566489] Call Trace:
[18477.566499] [<ffffffff81467485>] rwsem_down_failed_common+0xb5/0x160
[18477.566505] [<ffffffff81264d13>] call_rwsem_down_write_failed+0x13/0x20
[18477.566509] [<ffffffff8146679c>] down_write+0x1c/0x20
[18477.566541] [<ffffffffa05ddb6c>] xfs_ilock+0xec/0x100 [xfs]
[18477.566629] [<ffffffffa0604e47>] xfs_file_fallocate+0xc7/0x190 [xfs]
[18477.566665] [<ffffffff8115d629>] do_fallocate+0x129/0x130
[18477.566669] [<ffffffff8115d676>] sys_fallocate+0x46/0x70
[18477.566673] [<ffffffff8146f5f2>] system_call_fastpath+0x16/0x1b
[18477.566690] [<00007f344662b010>] 0x7f344662b00f
[18477.566692] Kernel panic - not syncing: hung_task: blocked tasks
[18477.566698] Pid: 144, comm: khungtaskd Tainted: G ENX 3.0.101-0.47.90-default #1
[18477.566701] Call Trace:
[18477.566707] [<ffffffff81004b95>] dump_trace+0x75/0x300
[18477.566712] [<ffffffff81464663>] dump_stack+0x69/0x6f
[18477.566717] [<ffffffff8146471f>] panic+0xb6/0x224
[18477.566722] [<ffffffff810c8731>] check_hung_uninterruptible_tasks+0x1e1/0x1f0
[18477.566726] [<ffffffff810c8787>] watchdog+0x47/0x50
[18477.566730] [<ffffffff810845f6>] kthread+0x96/0xa0
[18477.566735] [<ffffffff81470764>] kernel_thread_helper+0x4/0x10
23683 阻塞了很长时间,这个时间我们当时设置的是1200s。下面分析下阻塞的原因:
crash> bt 23683
PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter"
#0 [ffff88173e599c30] schedule at ffffffff814652b9
#1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485
#2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13
#3 [ffff88173e599e18] down_write at ffffffff8146679c
#4 [ffff88173e599e20] xfs_ilock at ffffffffa05ddb6c [xfs]
#5 [ffff88173e599e50] xfs_file_fallocate at ffffffffa0604e47 [xfs]
#6 [ffff88173e599f20] do_fallocate at ffffffff8115d629
#7 [ffff88173e599f50] sys_fallocate at ffffffff8115d676
#8 [ffff88173e599f80] system_call_fastpath at ffffffff8146f5f2
RIP: 00007f344662b010 RSP: 00007f2db9d0a2a0 RFLAGS: 00003287
RAX: 000000000000011d RBX: ffffffff8146f5f2 RCX: 0000000000000000
RDX: 0000000000800000 RSI: 0000000000000001 RDI: 0000000000000050
RBP: 0000000000880000 R8: 00007f340a0ecb60 R9: 0000000000005c83
R10: 0000000000100000 R11: 0000000000003246 R12: 0000000000880000
R13: 0000000009b20a20 R14: 00007f3408e01328 R15: 0000000000004800
ORIG_RAX: 000000000000011d CS: 0033 SS: 002b
看堆栈是在循环等信号量:
static struct rw_semaphore __sched *
rwsem_down_failed_common(struct rw_semaphore *sem,
unsigned int flags, signed long adjustment)
{
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
signed long count;
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
/* set up my own style of waitqueue */
spin_lock_irq(&sem->wait_lock);
waiter.task = tsk;
waiter.flags = flags;
get_task_struct(tsk);
if (list_empty(&sem->wait_list))
adjustment += RWSEM_WAITING_BIAS;
list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock, but no longer actively locking */
count = rwsem_atomic_update(adjustment, sem);
/* If there are no active locks, wake the front queued process(es) up.
*
* Alternatively, if we're called from a failed down_write(), there
* were already threads queued before us and there are no active
* writers, the lock must be read owned; so we try to wake any read
* locks that were queued ahead of us. */
if (count == RWSEM_WAITING_BIAS)
sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
else if (count > RWSEM_WAITING_BIAS &&
adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
spin_unlock_irq(&sem->wait_lock);
/* wait to be given the lock */
for (;;) {--------------------进入了这个循环,一直没出来,判断的条件就是看等待的waiter的task是否为空。
if (!waiter.task)
break;
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
tsk->state = TASK_RUNNING;
return sem;
}
由于处于uninterruptible时间超过阈值,所以最终hung的检测触发了crash。为什么判断waiter.task是否为NULL,是因为读写锁释放的时候,会查看等待队列,如果有waiter,则会
将waiter从等待队列中摘下,并且将 waiter->task = NULL; 然后唤醒等待者,等待者查看到自己 waiter.task是否为NULL来判断是否需要继续等待。
看看等待的信号量是什么:
crash> bt -f 23683
PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter"
#0 [ffff88173e599c30] schedule at ffffffff814652b9
ffff88173e599c38: 0000000000000082 ffff88173e598010
ffff88173e599c48: 0000000000010900 0000000000010900
ffff88173e599c58: 0000000000010900 0000000000010900
ffff88173e599c68: ffff88173e599fd8 ffff88173e599fd8
ffff88173e599c78: 0000000000010900 ffff88173e5964c0
ffff88173e599c88: ffff881fc3378580 ffff881f22341000
ffff88173e599c98: 00000000ffffff9c ffffffff81169f93
ffff88173e599ca8: ffff88173e599d58 ffff88173e599de8
ffff88173e599cb8: ffff881a8a7b21c0 ffff881a8a7b21c0
ffff88173e599cc8: ffff88173d8aa0c0 ffff88173d8afd90
ffff88173e599cd8: 0000000400000000 ffff88173e599dd8
ffff88173e599ce8: ffff8805b87fa240 ffffffff811695d8
ffff88173e599cf8: ffff88173e599d48 0000000000000000
ffff88173e599d08: ffff88173e599dd8 ffffffff8116db50
ffff88173e599d18: ffff88173e599e58 ffffffff8117661e
ffff88173e599d28: ffff88173e599de8 ffff88173e599e58
ffff88173e599d38: ffff8805b848f990 ffff88173e599e58
ffff88173e599d48: 0000000000000002 ffff8805b848f8a8
ffff88173e599d58: ffff8805b848f8b0 ffffffffffffffff
ffff88173e599d68: 0000000000800000 ffff88173e5964c0
ffff88173e599d78: ffffffff81467485
#1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485
ffff88173e599d80: ffff881706047ca8 ffff8805b848f8b8---------------------ffff881706047ca8就是waiter
ffff88173e599d90: ffff88173e5964c0 ffff881f00000002
ffff88173e599da0: ffff881f22341000 0000000000000000
ffff88173e599db0: ffffffffffffffa1 0000000000000001
ffff88173e599dc0: ffff8805b848f800 ffff8805b848f990
ffff88173e599dd0: 0000000000100000 ffffffff81264d13
#2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13
ffff88173e599de0: 0000000000003246 0000000000100000
ffff88173e599df0: ffff88171291ff20 ffff881ae2c64bc0
ffff88173e599e00: 0000000000100000 0000000000000001
ffff88173e599e10: ffff8805b848f8a8 ffffffff8146679c------------------ffff8805b848f8a8 就是信号量
#3 [ffff88173e599e18] down_write at ffffffff8146679c ffff88173e599e20: ffffffffa05ddb6c
根据反汇编,
0xffffffff81467476 <rwsem_down_failed_common+166>: cmpq $0x0,0x10(%rsp)
0xffffffff8146747c <rwsem_down_failed_common+172>: je 0xffffffff814674a9 <rwsem_down_failed_common+217>
0xffffffff8146747e <rwsem_down_failed_common+174>: xchg %ax,%ax
/usr/src/linux-3.0.101-0.47.90/lib/rwsem.c: 213
0xffffffff81467480 <rwsem_down_failed_common+176>: callq 0xffffffff81465600 <schedule>
可以确定rsp就是我们的waiter,也就是 ffff881706047ca8 。
由于call_rwsem_down_write_failed 不是c,所以继续往上回溯到xfs_ilock,找到xfs_inode_t为0xffff8805b848f800,读写信号量为:0xffff8805b848f8a8
crash> struct -x rw_semaphore 0xffff8805b848f8a8
struct rw_semaphore {
count = 0xffffffff00000001,
wait_lock = {
{
rlock = {
raw_lock = {
slock = 0x70007
}
}
}
},
wait_list = {
next = 0xffff88173e599d80,
prev = 0xffff881ec528bd80
}
}
根据rw_semaphore的结构,可以确定wait_list的地址为 ffff8805b848f8a8 + 0x10,也就是 ffff8805b848f8b8。
crash> list rwsem_waiter.list -H ffff8805b848f8b8 -s rwsem_waiter.task
ffff88173e599d80
task = 0xffff88173e5964c0
ffff881706047ca8
task = 0xffff8817060442c0
ffff881706057ca8
task = 0xffff8817060543c0
ffff88170605bca8
task = 0xffff881706058400
ffff883ec2437d80
task = 0xffff883e23c2c380
ffff883e54a0fd80
task = 0xffff883998b36200
ffff881ec528bd80
task = 0xffff881cd766e300
因为添加到等待队列的时候,是加到队列尾,
if (list_empty(&sem->wait_list))
adjustment += RWSEM_WAITING_BIAS;
list_add_tail(&waiter.list, &sem->wait_list);
所以第一个等待的task是list的头的next,也就是 ffff88173e599d80,对应的task是:
crash> task 0xffff88173e5964c0
PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter"
struct task_struct {
state = 2,-----------------------------TASK_UNINTERRUPTIBLE
stack = 0xffff88173e598000,
usage = {
counter = 3
},
然后对应的堆栈是:
crash> bt 23683
PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter"
#0 [ffff88173e599c30] schedule at ffffffff814652b9
#1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485
#2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13
#3 [ffff88173e599e18] down_write at ffffffff8146679c
#4 [ffff88173e599e20] xfs_ilock at ffffffffa05ddb6c [xfs]
#5 [ffff88173e599e50] xfs_file_fallocate at ffffffffa0604e47 [xfs]
#6 [ffff88173e599f20] do_fallocate at ffffffff8115d629
#7 [ffff88173e599f50] sys_fallocate at ffffffff8115d676
#8 [ffff88173e599f80] system_call_fastpath at ffffffff8146f5f2
RIP: 00007f344662b010 RSP: 00007f2db9d0a2a0 RFLAGS: 00003287
RAX: 000000000000011d RBX: ffffffff8146f5f2 RCX: 0000000000000000
RDX: 0000000000800000 RSI: 0000000000000001 RDI: 0000000000000050
RBP: 0000000000880000 R8: 00007f340a0ecb60 R9: 0000000000005c83
R10: 0000000000100000 R11: 0000000000003246 R12: 0000000000880000
R13: 0000000009b20a20 R14: 00007f3408e01328 R15: 0000000000004800
ORIG_RAX: 000000000000011d CS: 0033 SS: 002b
很明显,这个第一个等待的进程是获取写锁,写锁是排他性的,不管是读优先还是写优先,正常情况下都不应该等待这么长时间,所以唯一的可能是,谁占了锁,没释放。
在遍历其他几个等待的task的时候,发现了如下信息:
crash> task 0xffff8817060442c0
PID: 26637 TASK: ffff8817060442c0 CPU: 6 COMMAND: "21-IFileSender"
struct task_struct {
state = 2,
stack = 0xffff881706046000,
usage = {
counter = 3
},
crash> bt 26637
PID: 26637 TASK: ffff8817060442c0 CPU: 6 COMMAND: "21-IFileSender"
#0 [ffff881706047b58] schedule at ffffffff814652b9
#1 [ffff881706047ca0] rwsem_down_failed_common at ffffffff81467485
#2 [ffff881706047d00] call_rwsem_down_read_failed at ffffffff81264ce4
#3 [ffff881706047d48] down_read at ffffffff8146677e
#4 [ffff881706047d50] xfs_ilock at ffffffffa05ddb3c [xfs]
#5 [ffff881706047d80] caq_xfs_file_splice_read at ffffffffa06a1850 [pagecachelimit]--------这个是我加的代码
#6 [ffff881706047dc0] splice_direct_to_actor at ffffffff81188c6c
#7 [ffff881706047e30] do_sendfile at ffffffffa07119ab [witdriver]
#8 [ffff881706047ec0] my_sendfile at ffffffffa071db3a [witdriver]
#9 [ffff881706047f80] system_call_fastpath at ffffffff8146f5f2
RIP: 00007f13992bf3c9 RSP: 00007f136ab5f3e8 RFLAGS: 00000203
RAX: 00000000000000b5 RBX: ffffffff8146f5f2 RCX: 00007f136ab5f42c
RDX: 0000000000000048 RSI: 00007f136ab5e730 RDI: 0000000000000001
RBP: 00007f136ab5f800 R8: 00007f136ab5f800 R9: 0000000000000000
R10: 0000000000049fff R11: 0000000000000246 R12: 0000000000000524
R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000042e28
ORIG_RAX: 00000000000000b5 CS: 0033 SS: 002b
看到了一个等待路径是自己加的代码,所以赶紧走查一下自己的代码,发现如下:
static ssize_t
caq_xfs_file_splice_read(
struct file *infilp,
loff_t *ppos,
struct pipe_inode_info *pipe,
size_t count,
unsigned int flags)
{
loff_t isize, left;
int ret;
struct xfs_inode *ip = XFS_I(infilp->f_mapping->host);
xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);------------------------------拿锁了
isize = i_size_read(infilp->f_mapping->host);
if (unlikely(*ppos >= isize))//读的起始位置居然超过文件大小
return 0;------------------------------------------------------就这个地方出问题,没放锁就return 了
left = isize - *ppos;
if (unlikely(left < count))//保证读到文件结尾就结束
count = left;
ret = caq___generic_file_splice_read(infilp, ppos, pipe, count,
flags);
if (ret > 0) {
*ppos += ret;//文件当前偏移
}
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
return ret;
}
代码bug很明显,在一个异常分支流程,拿读锁之后,就退出了,没有放锁。而触发这个bug的业务场景就是,多个线程写一个文件,多个线程读一个文件,读的传入的参数,超过了
文件的大小,从而触发。 |