Страница 1 из 1

Падение сервера

Добавлено: 11 сен 2013, 12:27
redraw13
Здравствуйте
Так уже случилось что мне в наследство достался блок серверов который нужно администрировать. Малкие проблемы я решаю, но вот столкнулся с проблемой которую не могу решить, и даже не могу определить ее причину.

Есть сервер 16Гб оперативной памяти 8ядер. Несколько терабайт на винчестерах. На нем хранятся бекапы и крутится несколько ВПС на OpenVZ. В последнее время заметили его зависания.
В его логах наблюдаем такое:
Лог:
[spoiler]Sep 10 02:48:05 ns23 kernel: CT#202: nfs: server 213.186.121.67 not responding, still trying
Sep 10 02:48:05 ns23 last message repeated 7 times
Sep 10 02:50:17 ns23 nagios: SERVICE ALERT: localhost;Total Processes;CRITICAL;HARD;4;CRITICAL - Plugin timed out after 10 seconds
Sep 10 02:50:17 ns23 nagios: SERVICE NOTIFICATION: nagiosadmin;localhost;Total Processes;CRITICAL;notify-service-by-email;CRITICAL - Plugin timed out after 10 seconds
Sep 10 02:52:48 ns23 kernel: INFO: task kswapd0:400 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: kswapd0 D ffff81042d4ecbe0 0 400 161 401 397 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff81042c6379e0 0000000000000046 ffff81009d2a7040 ffff81009d2a7000
Sep 10 02:52:48 ns23 kernel: ffff81042d4ecbe0 ffff81042e3e6120 00020bbba16d4aea 0004c5a5558d363c
Sep 10 02:52:48 ns23 kernel: ffff81042d4ecde8 0000000310cb4540 0000000000000000 ffff81042e3e2000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006fcad>] do_gettimeofday+0x40/0x8f
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064d84>] io_schedule+0x59/0x8a
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821442>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064fb9>] __wait_on_bit+0x40/0x6e
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065053>] out_of_line_wait_on_bit+0x6c/0x78
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a32b8>] wake_bit_function+0x0/0x23
Sep 10 02:52:48 ns23 kernel: [<ffffffff88824a3e>] :nfs:nfs_wait_on_requests_locked+0x70/0xca
Sep 10 02:52:48 ns23 kernel: [<ffffffff88825bdc>] :nfs:nfs_sync_inode_wait+0x60/0x1db
Sep 10 02:52:48 ns23 kernel: [<ffffffff8881b725>] :nfs:nfs_release_page+0x25/0x3e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7339>] shrink_inactive_list+0x558/0x91e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d6cbe>] shrink_active_list+0x59f/0x5b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80012f50>] shrink_zone+0x16e/0x1d8
Sep 10 02:52:48 ns23 kernel: [<ffffffff8005abe7>] kswapd+0x35e/0x4c8
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a328a>] autoremove_wake_function+0x0/0x2e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a3072>] keventd_create_kthread+0x0/0xc4
Sep 10 02:52:48 ns23 kernel: [<ffffffff8005a889>] kswapd+0x0/0x4c8
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a3072>] keventd_create_kthread+0x0/0xc4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80033d25>] kthread+0xfe/0x132
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a3072>] keventd_create_kthread+0x0/0xc4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80033c27>] kthread+0x0/0x132
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11

Sep 10 02:52:48 ns23 kernel: INFO: task httpd:53296 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: httpd D ffff8103544b0ca0 0 53296 43493 616857 53306 52001 (NOTLB)
Sep 10 02:52:48 ns23 kernel: ffff8101a8fed7b8 0000000000000086 ffff8100a3156490 ffff8101d96be728
Sep 10 02:52:48 ns23 kernel: ffff8103544b0ca0 ffff81042dc761a0 00020bd2c2e7b55a 0004c5db49d67e6c
Sep 10 02:52:48 ns23 kernel: ffff8103544b0ea8 000000058804edf7 0000000000000000 ffff81042dc72000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006fcad>] do_gettimeofday+0x40/0x8f
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064d84>] io_schedule+0x59/0x8a
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821442>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064fb9>] __wait_on_bit+0x40/0x6e
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065053>] out_of_line_wait_on_bit+0x6c/0x78
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a32b8>] wake_bit_function+0x0/0x23
Sep 10 02:52:48 ns23 kernel: [<ffffffff88824a3e>] :nfs:nfs_wait_on_requests_locked+0x70/0xca
Sep 10 02:52:48 ns23 kernel: [<ffffffff88825bdc>] :nfs:nfs_sync_inode_wait+0x60/0x1db
Sep 10 02:52:48 ns23 kernel: [<ffffffff8881b725>] :nfs:nfs_release_page+0x25/0x3e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7339>] shrink_inactive_list+0x558/0x91e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d66f8>] isolate_lru_pages+0x98/0xbf
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d6cbe>] shrink_active_list+0x59f/0x5b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff800644fd>] thread_return+0x6a/0x177
Sep 10 02:52:48 ns23 kernel: [<ffffffff80012f50>] shrink_zone+0x16e/0x1d8
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7e9f>] try_to_free_pages+0x1da/0x3f0
Sep 10 02:52:48 ns23 kernel: [<ffffffff8000f0a8>] __alloc_pages+0x1e9/0x34d
Sep 10 02:52:48 ns23 kernel: [<ffffffff800df238>] __vmalloc_area_node+0xb7/0x15d
Sep 10 02:52:48 ns23 kernel: [<ffffffff800fc857>] expand_fdtable+0xf9/0x297
Sep 10 02:52:48 ns23 kernel: [<ffffffff80031f3e>] expand_files+0x2c/0x3c
Sep 10 02:52:48 ns23 kernel: [<ffffffff8002304f>] dup_fd+0x133/0x281
Sep 10 02:52:48 ns23 kernel: [<ffffffff8004bb89>] copy_files+0x48/0x80
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001fc87>] copy_process+0x648/0x1818
Sep 10 02:52:48 ns23 kernel: [<ffffffff80091304>] do_fork_pid+0x71/0x20c
Sep 10 02:52:48 ns23 kernel: [<ffffffff800c3571>] audit_syscall_entry+0x1a4/0x1cf
Sep 10 02:52:48 ns23 kernel: [<ffffffff8003270c>] do_fork+0xb/0x10
Sep 10 02:52:48 ns23 kernel: [<ffffffff800628e9>] ia32_ptregs_common+0x25/0x4c

Sep 10 02:52:48 ns23 kernel: INFO: task sh:55587 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: sh D ffff810192d5cc60 0 55587 55520 651926 (NOTLB)
Sep 10 02:52:48 ns23 kernel: ffff8101b2ff78d8 0000000000000082 0000000000011210 ffff81019fd46ac0
Sep 10 02:52:48 ns23 kernel: ffff810192d5cc60 ffff81042dcac1e0 00020bbded21a540 0004c5aab06a783f
Sep 10 02:52:48 ns23 kernel: ffff810192d5ce68 0000000606d38f95 0000000000000000 ffff81042dca8000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006fcad>] do_gettimeofday+0x40/0x8f
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064d84>] io_schedule+0x59/0x8a
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821442>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064fb9>] __wait_on_bit+0x40/0x6e
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065053>] out_of_line_wait_on_bit+0x6c/0x78
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a32b8>] wake_bit_function+0x0/0x23
Sep 10 02:52:48 ns23 kernel: [<ffffffff88824a3e>] :nfs:nfs_wait_on_requests_locked+0x70/0xca
Sep 10 02:52:48 ns23 kernel: [<ffffffff88825bdc>] :nfs:nfs_sync_inode_wait+0x60/0x1db
Sep 10 02:52:48 ns23 kernel: [<ffffffff8881b725>] :nfs:nfs_release_page+0x25/0x3e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7339>] shrink_inactive_list+0x558/0x91e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d66f8>] isolate_lru_pages+0x98/0xbf
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d6cbe>] shrink_active_list+0x59f/0x5b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80012f50>] shrink_zone+0x16e/0x1d8
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7e9f>] try_to_free_pages+0x1da/0x3f0
Sep 10 02:52:48 ns23 kernel: [<ffffffff8000f0a8>] __alloc_pages+0x1e9/0x34d
Sep 10 02:52:48 ns23 kernel: [<ffffffff8003e710>] __get_free_pages+0xe/0x63
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001f71c>] copy_process+0xdd/0x1818
Sep 10 02:52:48 ns23 kernel: [<ffffffff80091304>] do_fork_pid+0x71/0x20c
Sep 10 02:52:48 ns23 kernel: [<ffffffff800c3571>] audit_syscall_entry+0x1a4/0x1cf
Sep 10 02:52:48 ns23 kernel: [<ffffffff8003270c>] do_fork+0xb/0x10
Sep 10 02:52:48 ns23 kernel: [<ffffffff800628e9>] ia32_ptregs_common+0x25/0x4c

Sep 10 02:52:48 ns23 kernel: INFO: task nfsd:1002336 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: nfsd D ffff8101b4bbf660 0 1002336 1 1002337 1002335 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff8101abd21b60 0000000000000046 0000000000000000 ffffffff00000001
Sep 10 02:52:48 ns23 kernel: ffff8101b4bbf660 ffff81042bed1260 00020bbbbdf7a6b8 0004c5a5981fd749
Sep 10 02:52:48 ns23 kernel: ffff8101b4bbf868 ffff8101b4bbf660 0000000000000000 ffff81042dc1e000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff8003d17c>] __d_find_alias+0x78/0xd0
Sep 10 02:52:48 ns23 kernel: [<ffffffff8002bd96>] iput+0x4b/0x84
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065212>] __mutex_lock_slowpath+0x60/0x9b
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006525c>] .text.lock.mutex+0xf/0x14
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d252f>] generic_file_writev+0x37/0xa3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800eef03>] do_readv_writev+0xc6/0x1ca
Sep 10 02:52:48 ns23 kernel: [<ffffffff80018312>] do_sync_write+0x0/0x115
Sep 10 02:52:48 ns23 kernel: [<ffffffff886cd075>] :nfsd:exp_find_key+0xa1/0xb4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80226d82>] kernel_recvmsg+0x3b/0x4d
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f34a9>] :sunrpc:svc_recvfrom+0xd2/0x113
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c9950>] :nfsd:nfsd_vfs_write+0xf2/0x30f
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001efac>] __dentry_open+0x112/0x1ed
Sep 10 02:52:48 ns23 kernel: [<ffffffff886ca380>] :nfsd:nfsd_write+0xb5/0xd5
Sep 10 02:52:48 ns23 kernel: [<ffffffff886d105a>] :nfsd:nfsd3_proc_write+0xea/0x109
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c61ab>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f2ca8>] :sunrpc:svc_process+0x43e/0x705
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065bef>] __down_read+0x12/0x92
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c68fe>] :nfsd:nfsd+0x223/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11

Sep 10 02:52:48 ns23 kernel: INFO: task nfsd:1002337 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: nfsd D ffff81042bed09e0 0 1002337 1 1002338 1002336 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff8102f57efb60 0000000000000046 0000000000000000 ffffffff00000001
Sep 10 02:52:48 ns23 kernel: ffff81042bed09e0 ffff810292bd34e0 00020bbbbdf678e9 0004c5a5981d16cd
Sep 10 02:52:48 ns23 kernel: ffff81042bed0be8 000000002bed09e0 ffffffff8031fb80 ffffffff8049df00
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065212>] __mutex_lock_slowpath+0x60/0x9b
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006525c>] .text.lock.mutex+0xf/0x14
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d252f>] generic_file_writev+0x37/0xa3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800eef03>] do_readv_writev+0xc6/0x1ca
Sep 10 02:52:48 ns23 kernel: [<ffffffff80018312>] do_sync_write+0x0/0x115
Sep 10 02:52:48 ns23 kernel: [<ffffffff886cd075>] :nfsd:exp_find_key+0xa1/0xb4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80088989>] __wake_up_common+0x3e/0x68
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c9950>] :nfsd:nfsd_vfs_write+0xf2/0x30f
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001efac>] __dentry_open+0x112/0x1ed
Sep 10 02:52:48 ns23 kernel: [<ffffffff886ca380>] :nfsd:nfsd_write+0xb5/0xd5
Sep 10 02:52:48 ns23 kernel: [<ffffffff886d105a>] :nfsd:nfsd3_proc_write+0xea/0x109
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c61ab>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f2ca8>] :sunrpc:svc_process+0x43e/0x705
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065bef>] __down_read+0x12/0x92
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c68fe>] :nfsd:nfsd+0x223/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11

Sep 10 02:52:48 ns23 kernel: INFO: task nfsd:1002338 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: nfsd D ffff8104250f51a0 0 1002338 1 1002339 1002337 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff810395525b60 0000000000000046 0000000000000000 ffffffff00000001
Sep 10 02:52:48 ns23 kernel: ffff8104250f51a0 ffff81042e3e6120 00020bbbbde4c07a 0004c5a597f3c0cc
Sep 10 02:52:48 ns23 kernel: ffff8104250f53a8 00000003250f51a0 0000000000000000 ffff81042e3e2000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065212>] __mutex_lock_slowpath+0x60/0x9b
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006525c>] .text.lock.mutex+0xf/0x14
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d252f>] generic_file_writev+0x37/0xa3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800eef03>] do_readv_writev+0xc6/0x1ca
Sep 10 02:52:48 ns23 kernel: [<ffffffff80018312>] do_sync_write+0x0/0x115
Sep 10 02:52:48 ns23 kernel: [<ffffffff886cd075>] :nfsd:exp_find_key+0xa1/0xb4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80226d82>] kernel_recvmsg+0x3b/0x4d
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f34a9>] :sunrpc:svc_recvfrom+0xd2/0x113
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c9950>] :nfsd:nfsd_vfs_write+0xf2/0x30f
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001efac>] __dentry_open+0x112/0x1ed
Sep 10 02:52:48 ns23 kernel: [<ffffffff886ca380>] :nfsd:nfsd_write+0xb5/0xd5
Sep 10 02:52:48 ns23 kernel: [<ffffffff886d105a>] :nfsd:nfsd3_proc_write+0xea/0x109
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c61ab>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f2ca8>] :sunrpc:svc_process+0x43e/0x705
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065bef>] __down_read+0x12/0x92
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c68fe>] :nfsd:nfsd+0x223/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11

Sep 10 02:52:48 ns23 kernel: INFO: task nfsd:1002339 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: nfsd D ffff8104255fc5a0 0 1002339 1 1002340 1002338 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff8103b8ba3b60 0000000000000046 0000000000000000 ffffffff00000001
Sep 10 02:52:48 ns23 kernel: ffff8104255fc5a0 ffff8101b4bbf660 00020bbbbdf61845 0004c5a5981c35bc
Sep 10 02:52:48 ns23 kernel: ffff8104255fc7a8 00000004255fc5a0 ffffffff8031fb80 ffff81042dc1e000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065212>] __mutex_lock_slowpath+0x60/0x9b
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006525c>] .text.lock.mutex+0xf/0x14
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d252f>] generic_file_writev+0x37/0xa3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800eef03>] do_readv_writev+0xc6/0x1ca
Sep 10 02:52:48 ns23 kernel: [<ffffffff80018312>] do_sync_write+0x0/0x115
Sep 10 02:52:48 ns23 kernel: [<ffffffff886cd075>] :nfsd:exp_find_key+0xa1/0xb4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80088989>] __wake_up_common+0x3e/0x68
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c9950>] :nfsd:nfsd_vfs_write+0xf2/0x30f
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001efac>] __dentry_open+0x112/0x1ed
Sep 10 02:52:48 ns23 kernel: [<ffffffff886ca380>] :nfsd:nfsd_write+0xb5/0xd5
Sep 10 02:52:48 ns23 kernel: [<ffffffff886d105a>] :nfsd:nfsd3_proc_write+0xea/0x109
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c61ab>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f2ca8>] :sunrpc:svc_process+0x43e/0x705
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065bef>] __down_read+0x12/0x92
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c68fe>] :nfsd:nfsd+0x223/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11

Sep 10 02:52:48 ns23 kernel: INFO: task nfsd:1002340 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: nfsd D ffff81042c71aba0 0 1002340 1 1002341 1002339 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff8101bac89340 0000000000000046 ffff810001015280 0000000000000100
Sep 10 02:52:48 ns23 kernel: ffff81042c71aba0 ffff8104255fc5a0 00020bbbbdd25144 0004c5a597c8bd0f
Sep 10 02:52:48 ns23 kernel: ffff81042c71ada8 0000000425876140 ffffffff8031fb80 ffff81042dc74000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006fcad>] do_gettimeofday+0x40/0x8f
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064d84>] io_schedule+0x59/0x8a
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821442>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80064fb9>] __wait_on_bit+0x40/0x6e
Sep 10 02:52:48 ns23 kernel: [<ffffffff88821439>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065053>] out_of_line_wait_on_bit+0x6c/0x78
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a32b8>] wake_bit_function+0x0/0x23
Sep 10 02:52:48 ns23 kernel: [<ffffffff88824a3e>] :nfs:nfs_wait_on_requests_locked+0x70/0xca
Sep 10 02:52:48 ns23 kernel: [<ffffffff88825bdc>] :nfs:nfs_sync_inode_wait+0x60/0x1db
Sep 10 02:52:48 ns23 kernel: [<ffffffff8881b725>] :nfs:nfs_release_page+0x25/0x3e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7339>] shrink_inactive_list+0x558/0x91e
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d66f8>] isolate_lru_pages+0x98/0xbf
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d6cbe>] shrink_active_list+0x59f/0x5b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80012f50>] shrink_zone+0x16e/0x1d8
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d7e9f>] try_to_free_pages+0x1da/0x3f0
Sep 10 02:52:48 ns23 kernel: [<ffffffff8003c723>] unmap_underlying_metadata+0x1d/0x5b
Sep 10 02:52:48 ns23 kernel: [<ffffffff8000f0a8>] __alloc_pages+0x1e9/0x34d
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d1d55>] grab_cache_page_write_begin+0x4a/0x89
Sep 10 02:52:48 ns23 kernel: [<ffffffff880514d1>] :ext3:ext3_write_begin+0x70/0x1cc
Sep 10 02:52:48 ns23 kernel: [<ffffffff8000fa09>] generic_file_buffered_write+0x166/0x673
Sep 10 02:52:48 ns23 kernel: [<ffffffff800660d9>] _spin_lock_bh+0x9/0x14
Sep 10 02:52:48 ns23 kernel: [<ffffffff800168d6>] __generic_file_aio_write_nolock+0x369/0x3b6
Sep 10 02:52:48 ns23 kernel: [<ffffffff80033025>] sock_common_recvmsg+0x2d/0x43
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d24df>] __generic_file_write_nolock+0x9a/0xb3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800a328a>] autoremove_wake_function+0x0/0x2e
Sep 10 02:52:48 ns23 kernel: [<ffffffff8805791b>] :ext3:ext3_get_dentry+0x89/0xbb
Sep 10 02:52:48 ns23 kernel: [<ffffffff886bb3ce>] :exportfs:find_exported_dentry+0x85/0x4e5
Sep 10 02:52:48 ns23 kernel: [<ffffffff800650a8>] mutex_lock+0xd/0x1d
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d2540>] generic_file_writev+0x48/0xa3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800eef03>] do_readv_writev+0xc6/0x1ca
Sep 10 02:52:48 ns23 kernel: [<ffffffff80018312>] do_sync_write+0x0/0x115
Sep 10 02:52:48 ns23 kernel: [<ffffffff886cd075>] :nfsd:exp_find_key+0xa1/0xb4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80088989>] __wake_up_common+0x3e/0x68
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c9950>] :nfsd:nfsd_vfs_write+0xf2/0x30f
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001efac>] __dentry_open+0x112/0x1ed
Sep 10 02:52:48 ns23 kernel: [<ffffffff886ca380>] :nfsd:nfsd_write+0xb5/0xd5
Sep 10 02:52:48 ns23 kernel: [<ffffffff886d105a>] :nfsd:nfsd3_proc_write+0xea/0x109
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c61ab>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f2ca8>] :sunrpc:svc_process+0x43e/0x705
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065bef>] __down_read+0x12/0x92
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c68fe>] :nfsd:nfsd+0x223/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11

Sep 10 02:52:48 ns23 kernel: INFO: task nfsd:1002341 blocked for more than 300 seconds.
Sep 10 02:52:48 ns23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 02:52:48 ns23 kernel: nfsd D ffff81042d078960 0 1002341 1 1002342 1002340 (L-TLB)
Sep 10 02:52:48 ns23 kernel: ffff8102f326fb60 0000000000000046 0000000000000000 ffffffff00000001
Sep 10 02:52:48 ns23 kernel: ffff81042d078960 ffff81042e3b00e0 00020bbbbde30a34 0004c5a597efc0f0
Sep 10 02:52:48 ns23 kernel: ffff81042d078b68 000000022d078960 0000000000000000 ffff81042e3ac000
Sep 10 02:52:48 ns23 kernel: Call Trace:
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065212>] __mutex_lock_slowpath+0x60/0x9b
Sep 10 02:52:48 ns23 kernel: [<ffffffff8006525c>] .text.lock.mutex+0xf/0x14
Sep 10 02:52:48 ns23 kernel: [<ffffffff800d252f>] generic_file_writev+0x37/0xa3
Sep 10 02:52:48 ns23 kernel: [<ffffffff800eef03>] do_readv_writev+0xc6/0x1ca
Sep 10 02:52:48 ns23 kernel: [<ffffffff80018312>] do_sync_write+0x0/0x115
Sep 10 02:52:48 ns23 kernel: [<ffffffff886cd075>] :nfsd:exp_find_key+0xa1/0xb4
Sep 10 02:52:48 ns23 kernel: [<ffffffff80088989>] __wake_up_common+0x3e/0x68
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c9950>] :nfsd:nfsd_vfs_write+0xf2/0x30f
Sep 10 02:52:48 ns23 kernel: [<ffffffff8001efac>] __dentry_open+0x112/0x1ed
Sep 10 02:52:48 ns23 kernel: [<ffffffff886ca380>] :nfsd:nfsd_write+0xb5/0xd5
Sep 10 02:52:48 ns23 kernel: [<ffffffff886d105a>] :nfsd:nfsd3_proc_write+0xea/0x109
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c61ab>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Sep 10 02:52:48 ns23 kernel: [<ffffffff885f2ca8>] :sunrpc:svc_process+0x43e/0x705
Sep 10 02:52:48 ns23 kernel: [<ffffffff80065bef>] __down_read+0x12/0x92
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c68fe>] :nfsd:nfsd+0x223/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80061001>] child_rip+0xa/0x11
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff886c66db>] :nfsd:nfsd+0x0/0x3b1
Sep 10 02:52:48 ns23 kernel: [<ffffffff80060ff7>] child_rip+0x0/0x11
Sep 10 02:52:48 ns23 kernel:
Sep 10 02:54:06 ns23 kernel: CT#202: nfs: server 213.186.121.67 not responding, still trying
Sep 10 02:54:06 ns23 last message repeated 7 times
Sep 10 02:55:27 ns23 nagios: SERVICE ALERT: localhost;Current Users;CRITICAL;SOFT;1;(Service Check Timed Out)
Sep 10 02:55:37 ns23 nagios: SERVICE ALERT: localhost;Current Users;OK;SOFT;2;USERS OK - 0 users currently logged in
Sep 10 02:57:17 ns23 nagios: SERVICE ALERT: localhost;Swap Usage;CRITICAL;SOFT;1;(Service Check Timed Out)[/spoiler]


После чего лог на хардноде обрывается. И начинается уже с момента перезагрузки всего сервера.
Но странно то что на ВПС размещенных на нем логи ведутся и ничего подозрительного там нет.
Хочется понять что с ним происходит потому что он находится в одном корпусе с еще тремя серверами и его перезагрузка тянет перезагрузку и остальных.

Спасибо за любую помощь.

Re: Падение сервера

Добавлено: 11 сен 2013, 15:55
Raven
Похоже что у вас помимо всего там крутится еще и демон nfsd, и видимо он отьедает всю оперативутак, что сервант начинает жестко юзать своп, но он тож как известно не резиновый. Мой вам совет, для начала запустите atop демоном - пусть собирает статистику. После очередного такого выкидона запускайте atop -r и листая снимки клавишами T/Shift+T смотрите кто жрет. Кстати не мешало бы указать версию ОС и используемого ядра.

Re: Падение сервера

Добавлено: 12 сен 2013, 15:03
redraw13
Спасибо за совет. Но до нового падения нам нельзя доводить. Слишком уж это болезненно. Сервер падает с интервалом в 7 дней. На нем установлена CentOS 5.9 с ядром 2.6.18-238.19.1.el5.028stab092.2
nfsd - там крутится потому что некоторые сервера подключаются к нему по этому протоколу и скидывают туда бекапы.

Re: Падение сервера

Добавлено: 12 сен 2013, 15:25
Raven
redraw13 писал(а):nfsd - там крутится потому что некоторые сервера подключаются к нему по этому протоколу и скидывают туда бекапы.
А не синхронно-ли они это делают? Ну, в смысле одновременно. а то знаете-ли периодичность интересная - сдается мне валится он от того, что все разом ломятся сливать бекапы. Можно попробовать разнести по разным дням.

Как вариант я мог бы вам посоветовать совсем отказаться от NFS - я к примеру для бекапилок юзаю rsync поверх ssh или scp. Но я так понимаю вы сейчас не сильно предрасположены к кардинальным изменениям.
ОК, покажите пожалуйста содержимое файла /etc/exports и строчку отвечающую за монтирование шары из файла /etc/fstab с клиентской машины. Поглядим как оно у вас построено.

Да, как вариант надо еще посмотреть состояние винтов - если кто-то один начинает тупить, то все операции io к нему встанут в очередь, которая будет рости, а эт сами понимаете - не кавай.

Re: Падение сервера

Добавлено: 12 сен 2013, 17:29
redraw13
/etc/exports

/disk4/backup <IP>(rw,no_root_squash)
/disk2/server1 <IP>(rw,no_root_squash)
/disk4/backup/ <IP>(rw,no_root_squash)
/disk2/server2 <IP>(rw,no_root_squash)
/disk2/btl <IP>(rw,no_root_squash)
/disk2/server3 <IP>(rw,no_root_squash)
/disk2/server4 <IP>(rw,no_root_squash)
/disk2/server5 <IP>(rw,no_root_squash)
/disk4/backup/asterisk <IP>(rw,no_root_squash)

В /etc/fstab сточек для монтирования нету. Скрипт маунтит раздел непосредственно перед началом бекапа.
mount -t nfs <IP>:/disk4/backup /backup
Бекапы туда скидаются каждый день и это не вызывало проблем с тем что большой поток данных.

Re: Падение сервера

Добавлено: 13 сен 2013, 11:42
Raven
В файл /etc/exports я бы добавил еще пару опций к шарам - async,no_subtree_check.

Плюс можно попробовать поиграться с опциями монтирования на клиентах - размеры буфферов подогнать и т.д.

Код: Выделить всё

defaults,nosuid,nolock,noatime,bg,vers=3,rsize=32768,wsize=32768,hard,intr
это у меня на последнем серваке где еще есть nfs клиент так подключается