System panics/freezes randomly

adgenet · May 31, 2016

So I've been running into an incredibly annoying problem with my new-to-me setup.
Please take a quick look at my thread here on the nas4free forums for the entire timeline: System panics randomly and reboots - Forums

Every day (or two-three) or so, my nas4free system will panic (or outright freeze).
Now, initially, I had my two PERC H310s passed through with proxmox as the host. Thinking this was the problem, I tried nas4free on bare metal, but still have the same problem.

Long story short, I've tried all of the following with no luck: swap CPUs, swap motherboards, swap RAM (many different sticks), swap PCI slots for the HBAs, swap disk locations on the backplane, swap power supplies, add cooling to the IOH, add cooling to the HBAs, using only one HBA, change from VM to bare metal install, swapped to entirely different CPU/mobo combo.

When running in the VM, I get a panic and some output:

Code:

May 21 18:23:34 nas syslogd: kernel boot file is /boot/kernel/kernel
May 21 18:23:34 nas kernel:
May 21 18:23:34 nas kernel:
May 21 18:23:34 nas kernel: Fatal trap 12: page fault while in kernel mode
May 21 18:23:34 nas kernel: cpuid = 1; apic id = 01
May 21 18:23:34 nas kernel: fault virtual address   = 0x10
May 21 18:23:34 nas kernel: fault code      = supervisor read data, page not present
May 21 18:23:34 nas kernel: instruction pointer   = 0x20:0xffffffff80a12e05
May 21 18:23:34 nas kernel: stack pointer           = 0x28:0xfffffe0237e76980
May 21 18:23:34 nas kernel: frame pointer           = 0x28:0xfffffe0237e769e0
May 21 18:23:34 nas kernel: code segment      = base 0x0, limit 0xfffff, type 0x1b
May 21 18:23:34 nas kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
May 21 18:23:34 nas kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
May 21 18:23:34 nas kernel: current process      = 2232 (transmission-daemon)
May 21 18:23:34 nas kernel: trap number      = 12
May 21 18:23:34 nas kernel: panic: page fault
May 21 18:23:34 nas kernel: cpuid = 1
May 21 18:23:34 nas kernel: KDB: stack backtrace:
May 21 18:23:34 nas kernel: #0 0xffffffff80a909d0 at kdb_backtrace+0x60
May 21 18:23:34 nas kernel: #1 0xffffffff80a531f6 at vpanic+0x126
May 21 18:23:34 nas kernel: #2 0xffffffff80a530c3 at panic+0x43
May 21 18:23:34 nas kernel: #3 0xffffffff80ed75fb at trap_fatal+0x36b
May 21 18:23:34 nas kernel: #4 0xffffffff80ed78fd at trap_pfault+0x2ed
May 21 18:23:34 nas kernel: #5 0xffffffff80ed6f7a at trap+0x47a
May 21 18:23:34 nas kernel: #6 0xffffffff80ebcfd2 at calltrap+0x8
May 21 18:23:34 nas kernel: #7 0xffffffff80a07869 at _fdrop+0x29
May 21 18:23:34 nas kernel: #8 0xffffffff80a0a10e at closef+0x21e
May 21 18:23:34 nas kernel: #9 0xffffffff80a07c18 at closefp+0x98
May 21 18:23:34 nas kernel: #10 0xffffffff80ed7fcf at amd64_syscall+0x40f
May 21 18:23:34 nas kernel: #11 0xffffffff80ebd2bb at Xfast_syscall+0xfb
May 21 18:23:34 nas kernel: Uptime: 18h54m55s
May 21 18:23:34 nas kernel: (da3:mps1:0:1:0): Synchronize cache failed
May 21 18:23:34 nas kernel: (da4:mps1:0:4:0): Synchronize cache failed
May 21 18:23:34 nas kernel: Automatic reboot in 15 seconds - press a key on the console to abort
May 21 18:23:34 nas kernel: Rebooting...
May 21 18:23:34 nas kernel: cpu_reset: Restarting BSP
May 21 18:23:34 nas kernel: cpu_reset_proxy: Stopped CPU 1s

When running on bare metal, the whole screen freezes, and no logs are produced.
Either way, the symptoms are the same. System runs for hours to days without issue, very light workload (smbd for files, afp on another dataset for time machine, and transmission), 2 users total, then randomly freezes or panics, combined with high cpu temp (due to the freeze I assume).

Somebody on the nas4free forum suggested using only one HBA, but that made no difference

At this point all I can think of is some sort of issue with the HBAs, either a problem specific to flashed H310s, or the firmware I'm running (P19, also tried the lateset P20 release 7).

If anybody has any suggestions at all, please let me know.
I'm losing sleep over this thing.

Terry Kennedy · May 31, 2016

adgenet said:
So I've been running into an incredibly annoying problem with my new-to-me setup.
May 21 18:23:34 nas kernel: KDB: stack backtrace:
May 21 18:23:34 nas kernel: #0 0xffffffff80a909d0 at kdb_backtrace+0x60
May 21 18:23:34 nas kernel: #1 0xffffffff80a531f6 at vpanic+0x126
May 21 18:23:34 nas kernel: #2 0xffffffff80a530c3 at panic+0x43
May 21 18:23:34 nas kernel: #3 0xffffffff80ed75fb at trap_fatal+0x36b
May 21 18:23:34 nas kernel: #4 0xffffffff80ed78fd at trap_pfault+0x2ed
May 21 18:23:34 nas kernel: #5 0xffffffff80ed6f7a at trap+0x47a
May 21 18:23:34 nas kernel: #6 0xffffffff80ebcfd2 at calltrap+0x8
May 21 18:23:34 nas kernel: #7 0xffffffff80a07869 at _fdrop+0x29
May 21 18:23:34 nas kernel: #8 0xffffffff80a0a10e at closef+0x21e
May 21 18:23:34 nas kernel: #9 0xffffffff80a07c18 at closefp+0x98
May 21 18:23:34 nas kernel: #10 0xffffffff80ed7fcf at amd64_syscall+0x40f
May 21 18:23:34 nas kernel: #11 0xffffffff80ebd2bb at Xfast_syscall+0xfb[/code]

If anybody has any suggestions at all, please let me know.
I'm losing sleep over this thing.

If the backtrace is the same (the symbolic parts - foo+0xbar) on each crash, it is highly unlikely that this is a hardware problem. In particular, your panic is coming when a file is being closed (rather than in an I/O routine) so it is unlikely to be your disk controller / drives.

It looks like _fdrop() is being called with an invalid struct file *fp pointer (source code reference). You're probably going to need a kernel with full debugging info available (to get source code in the tracebacks and examine variables) and somebody with the time to slog through it. Unless there is a "smoking gun" in that traceback, it is going to be time-consuming to find. Some bugs are an obvious "we forgot to take a lock out here" type thing. This one is almost certainly a bad parameter being passed into good code.

Feel free to quote me in the NAS4Free thread if it will help you attract a developer.

adgenet · Jun 1, 2016

Thanks so much for the reply.
I will post your comments on the NAS4Free forum.

The panic backtrace was always the same when it occurred running as a VM - bare metal just freezes so I can't seem to get any useful info, although I would guess it's probably the same since it fails in pretty much the same way.
It's encouraging to hear that it isn't a hardware problem in that case.

Is it likely that this is a bug with the drivers for the LSI SAS2008 chipset or a more deeply rooted problem with freebsd?
With the SAS2008 chipset as popular as it is, I'd expect to see more people complaining about this if the driver was the cause...

Is there any merit to trying with my drives hooked up to the motherboard ICH10 SATA ports?

Terry Kennedy · Jun 1, 2016

adgenet said:
Is it likely that this is a bug with the drivers for the LSI SAS2008 chipset or a more deeply rooted problem with freebsd?

It never got to the driver level since it panics in a worker routine for close(). And since it is consistent, it isn't a memory corruption issue as you'd have panics in several places, not always in _fdrop().

Is there any merit to trying with my drives hooked up to the motherboard ICH10 SATA ports?

If it works, it is just due to a quirk of timing and it may pop up again in the future.

If you can't get a NAS4Free developer to look at it, you might try email freebsd-stable@freebsd.org. If you go that route, I'd suggest:

Providing the actual FreeBSD kernel version string, not the NAS4Free version
Say you've already asked for help elsewhere
Include part of my diagnosis of the problem (where I tracked it down to a bad *fp argument to _fdrop() and suggested further debugging steps)

Search

System panics/freezes randomly

adgenet

Member

Terry Kennedy

Well-Known Member

adgenet

Member

Terry Kennedy

Well-Known Member