Truenas VM bad page state errors

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Railgun

Active Member
Jul 28, 2018
148
56
28
This is a copy paste from elsewhere so ignore timing related issues.

I've been having general performance issues as of late that I'd not really bothered to investigate until now.

I'm running Truenas 22.02.04 as a VM on an esxi 7 host. 16 CPUs (EPYC ROME) and 128GB ECC RAM, of which I've locked 118GB to ARC. Disks are 7 VDEVs of 6TB WD Reds in RAIDZ1 and a spare off a Broadcom HBA passed through to the VM. 10Gb between the host and client (Win11). This particular dataset has a 1M record size (all media).

My issue is on writes, I see the following, where they peak at maybe 300MBps:

2022 Oct 4 17:53:12 truenas BUG: Bad page state in process smbd pfn:35d65b
2022 Oct 4 17:54:23 truenas BUG: Bad page state in process ksoftirqd/2 pfn:9e6fb3
2022 Oct 4 17:54:38 truenas BUG: Bad page state in process swapper/2 pfn:16125a6
2022 Oct 4 17:54:40 truenas BUG: Bad page state in process spl_kmem_cache pfn:13bd9e3
2022 Oct 4 17:54:43 truenas BUG: Bad page state in process smbd pfn:bdaee6
2022 Oct 4 17:54:48 truenas BUG: Bad page state in process smbd pfn:19884ac
2022 Oct 4 17:54:48 truenas BUG: Bad page state in process smbd pfn:ebbdb2
2022 Oct 4 17:54:48 truenas BUG: Bad page state in process smbd pfn:17b048
2022 Oct 4 17:54:52 truenas BUG: Bad page state in process smbd pfn:155ca7e
2022 Oct 4 17:54:56 truenas BUG: Bad page state in process smbd pfn:15d5ca4
2022 Oct 4 17:55:00 truenas BUG: Bad page state in process smbd pfn:19337fc
2022 Oct 4 17:55:26 truenas BUG: Bad page state in process smbd pfn:210117
2022 Oct 4 17:55:27 truenas BUG: Bad page state in process smbd pfn:1a43198
2022 Oct 4 17:55:42 truenas BUG: Bad page state in process smbd pfn:cd0452
2022 Oct 4 17:55:42 truenas BUG: Bad page state in process swapper/1 pfn:170d917
2022 Oct 4 17:55:54 truenas BUG: Bad page state in process smbd pfn:284eda
2022 Oct 4 17:56:10 truenas BUG: Bad page state in process smbd pfn:12c09c8
2022 Oct 4 17:56:14 truenas BUG: Bad page state in process smbd pfn:b34976
2022 Oct 4 17:56:15 truenas BUG: Bad page state in process smbd pfn:2a783a
2022 Oct 4 17:56:15 truenas BUG: Bad page state in process smbd pfn:cbb60a
2022 Oct 4 17:56:16 truenas BUG: Bad page state in process smbd pfn:1c6043f
2022 Oct 4 17:56:36 truenas BUG: Bad page state in process smbd pfn:a8b0ab
2022 Oct 4 17:56:38 truenas BUG: Bad page state in process smbd pfn:145275a
2022 Oct 4 17:56:46 truenas BUG: Bad page state in process smbd pfn:188ecb5
2022 Oct 4 17:56:46 truenas BUG: Bad page state in process smbd pfn:12a93b
2022 Oct 4 17:56:47 truenas BUG: Bad page state in process smbd pfn:38d01b
2022 Oct 4 18:01:35 truenas BUG: Bad page state in process smbd pfn:1dd2d75
2022 Oct 4 18:01:35 truenas BUG: Bad page state in process kworker/4:1 pfn:b8af25
2022 Oct 4 18:01:51 truenas BUG: Bad page state in process smbd pfn:1a9b20e
2022 Oct 4 18:01:51 truenas BUG: Bad page state in process kworker/11:1 pfn:5afb3f
2022 Oct 4 18:01:54 truenas BUG: Bad page state in process spl_kmem_cache pfn:5afb3f
2022 Oct 4 18:01:54 truenas BUG: Bad page state in process spl_kmem_cache pfn:b2f9f5

I'd not actively been on the console while I've done any writes so this is the first time I've seen it. I've also discovered there's some crash that occurred as my share disappeared in the middle of a write and the session to the VM died, so something bounced. I see a file in /var/crash as kdump_lock at the time of my session exit, but it has no data (0 bytes)

Page state would tell me this is memory related. I'd not done any tests yet to confirm, but something isn't happy in that respect. I see no issues on reads (400-450MBps, which is also less than I'd expect) with respect to any log messages.

Happy to provide anything else I can here if it's not obvious that it IS a memory issue.

I run a OpnSense FW as well as my network controller and other things on the same host and haven't noticed any particular issues in that respect, assuming memory again.

After one round of memtest, nothing was found. I’ll run another round again when I can kill the host for an extended period.