Kernel panic a nefunkcni remote management
Miroslav Lachman
000.fbsd at quip.cz
Thu Apr 14 04:42:24 CEST 2016
Na pomerne starem stroji Sun Fire x2100 M2 mam FreeBSD 10.3 a o vikendu
doslo na kernel panic:
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x0
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff809668cf
stack pointer = 0x28:0xfffffe0175963810
frame pointer = 0x28:0xfffffe01759638a0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 12 (irq23: atapci1)
trap number = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff8098e390 at kdb_backtrace+0x60
#1 0xffffffff80951066 at vpanic+0x126
#2 0xffffffff80950f33 at panic+0x43
#3 0xffffffff80d55f7b at trap_fatal+0x36b
#4 0xffffffff80d5627d at trap_pfault+0x2ed
#5 0xffffffff80d558fa at trap+0x47a
#6 0xffffffff80d3b8d2 at calltrap+0x8
#7 0xffffffff80e3f7de at handleevents+0x18e
#8 0xffffffff80e40118 at timercb+0x318
#9 0xffffffff80e794fc at lapic_handle_timer+0x9c
#10 0xffffffff80d3c42c at Xtimerint+0x8c
#11 0xffffffff803f68e5 at ata_interrupt+0x45
#12 0xffffffff803fdcfe at ata_generic_intr+0x1e
#13 0xffffffff8091c99b at intr_event_execute_handlers+0xab
#14 0xffffffff8091cde6 at ithread_loop+0x96
#15 0xffffffff8091a4ea at fork_exit+0x9a
Server v tomhle stavu zustal viset a musel jsem ho restartovat pres IP
zasuvky, protoze se nedalo dostat ani na embedded remote management (eLOM)
Po tom, co server nabehl vydrzel bezet asi pul hodiny a znovu se uplne
odmlcel. Tentokrat v messages nezustala ani zadna hlaska o kernel
panicu. Remote management byl opet nedostupny, takze jsem mel podezreni
na odchazejici HW.
Server jsem znovu pres IP zasuvky restartoval a pres KVM se dival, co se
tam deje. Nabihal hrozne pomalu, ale to prisuzuju castecne tomu, ze
dochazelo k synchronizaci gmirroru a do toho se pak jeste spousti bgfsck
(ale to asi az po spusteni vsech sluzeb)
Doslo pak na dalsi panic
panic: ufs_dirbad: /vol0: bad dir ino 25130646 at offset 512: mangled entry
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff8098e390 at kdb_backtrace+0x60
#1 0xffffffff80951066 at vpanic+0x126
#2 0xffffffff80950f33 at panic+0x43
#3 0xffffffff80bb7a60 at ufs_lookup_ino+0xea0
#4 0xffffffff80e807b1 at VOP_CACHEDLOOKUP_AVP+0xa1
#5 0xffffffff809e4486 at vfs_cache_lookup+0xd6
#6 0xffffffff80e806a1 at VOP_LOOKUP_APV+0xa1
#7 0xffffffff809ecba1 at lookup+0x5a1
#8 0xffffffff809ec304 at namei+0x4d4
#9 0xffffffff80a05a0d at vn_open_cred+0x24d
#10 0xffffffff809fecef at kern_openat+0x26f
#11 0xffffffff80d5694f at amd64_syscall+0x40f
#12 0xffffffff80d3bbbb at Xfast_syscall+0xfb
Uptime: 25m33s
Pak kolega v serverovne prehodil disky do rezervniho stroje. Spustil
jsem ho jen do single user rezimu a spustil fsck na vsechny oddily. Bylo
tam spousta ztracenych souboru a dalsich chyb.
Po opraveni server normalne nabehnul, ale po chvili behu se objevila hlaska:
free inode /vol0/25059517 had 4 blocks
Tak jsem ocekaval dalsi panic, ale uz se nic nedelo a stroj od soboty
bezi normalne.
Do puvodniho HW jsem dal jine disky a zkusil ten server ruzne zatezovat,
udelat upgrade OS, reinstalaci baliku a vsechno bezi normalne.
Takze moje puvodni domenka, ze odchazi HW, se tak uplne nepotvrdila a i
to, ze nejede remote management po kernel panicu by se mozna dalo svest
na to, ze tyhle servery maji tu sitovku pro BMC sdilenou se systemem,
takze pri panicu muze dojit k nejake spatne reinicializaci sitovky a
prestane fungovat management. (je tahle domenka spravna?)
Da se z tech vyse uvedenych panicu vycist neco zajimaveho?
Protoze v tuto chvili vlastne nemam ani tuseni, proc to umrelo napoprve
a proc to pak padalo znovu a znovu. Jestli ty problemy na disku
predchazely ten prvni panic, nebo jestli byly az jeho dusledkem.
Mirek
More information about the Users-l
mailing list