LSI SAS2008, IR mode, RAID 1 with two S3700; poor performance for a while after reboot

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

logan893

Member
Aug 12, 2016
68
12
8
44
I have two LSI SAS2008 9212-4i4e in my server. Both with FW 20.00.04.00. One in IR mode, used for ESXi datastore. One in IT for passthrough to a VM (FreeNAS).

The LSI card in IR mode has two Intel SSD DC S3700 400GB drives in a RAID 1 array.

I did a secure erase of the Intel drives on a separate PC, then constructed the array from MSM (MegaRAID Storage Manager) in Windows. When initialization was completed, I moved the LSI card and the SSDs to the server.

After the server booted into ESXi, the array was showing up as Enabled/Optimal, and no operation ongoing. Creating the vmfs partition took longer than expected, perhaps a minute or two. Looking at the latency of the storage array, it's abysmal! Average read and write latency is in the range of 80-200 ms!

Code:
$ /opt/lsi/bin/sas2ircu 0 status
LSI Corporation SAS2 IR Configuration Utility.
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved.

Background command progress status for controller 0...
IR Volume 1
  Volume ID                               : 286
  Current operation                       : None
  Volume status                           : Enabled
  Volume state                            : Optimal
  Volume wwid                             : 084ba55b27b4b25b
  Physical disk I/Os                      : Not quiesced
SAS2IRCU: Command STATUS Completed Successfully.
SAS2IRCU: Utility Completed Successfully.
Using Microsoft's diskspd benchmarking tool, I often had no more than 300-800 IOPS at 4kB random IO, and the latency was approximately 75% below 1-2 ms, and 25% at 180+ ms.

This was continuous for several hours, after which I left the array with only a single Windows 10 VM for a day and a half. After letting it sit, the array had stabilized, and I now had reasonable latency with averages of 1-3 ms. Worst-case latency during 4 and 8 kB read/write random IOPS benchmarking using diskspd was 20 ms, as expected with S3700 drives.

Performance was great for several days. Until I restarted the server yesterday. After a warm restart of the server, the array is once more providing terrible performance. Status still shows optimal. I would equate the performance to the same performance I've seen during background initialization.

Code:
$ /opt/lsi/bin/sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved.

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS2008
  BIOS version                            : 7.39.00.00
  Firmware version                        : 20.00.04.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 255
  Concurrent commands supported           : 1720
  Slot                                    : 4
  Segment                                 : 0
  Bus                                     : 1
  Device                                  : 0
  Function                                : 0
  RAID Support                            : Yes
------------------------------------------------------------------------
IR Volume information
------------------------------------------------------------------------
IR volume 1
  Volume ID                               : 286
  Volume Name                             : Intel_RAID1
  Status of volume                        : Okay (OKY)
  Volume wwid                             : 084ba55b27b4b25b
  RAID level                              : RAID1
  Size (in MB)                            : 380516
  Physical hard disks                     :
  PHY[0] Enclosure#/Slot#                 : 1:4
  PHY[1] Enclosure#/Slot#                 : 1:5
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 4
  SAS Address                             : 4433221-1-0400-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 381554/781422767
  Manufacturer                            : ATA
  Model Number                            : INTEL SSDSC1NA40
  Firmware Revision                       : 2270
  Serial No                               : BTTV3172006Y400BGN
  GUID                                    : 50015178f3610b3b
  Protocol                                : SATA
  Drive Type                              : SATA_SSD

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 5
  SAS Address                             : 4433221-1-0500-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 381554/781422767
  Manufacturer                            : ATA
  Model Number                            : INTEL SSDSC1NA40
  Firmware Revision                       : 2270
  Serial No                               : BTTV31720202400BGN
  GUID                                    : 50015178f3611312
  Protocol                                : SATA
  Drive Type                              : SATA_SSD
------------------------------------------------------------------------
Enclosure information
------------------------------------------------------------------------
  Enclosure#                              : 1
  Logical ID                              : 500605b0:04640ee0
  Numslots                                : 8
  StartSlot                               : 0
------------------------------------------------------------------------
SAS2IRCU: Command DISPLAY Completed Successfully.
SAS2IRCU: Utility Completed Successfully.
Intel SSD DC S3700 have the latest firmware 5DV12270.

Looking at the SAS-2 Integrated RAID Solution User Guide it lists a number of automatic background tasks.
SAS-2 Integrated RAID Solution User Guide (975 KB)

2.4.5 Media Verification
The Integrated RAID firmware supports a background media verification feature that runs at regular intervals when the mirrored volume is in the Optimal state. If the verification command fails for any reason, the firmware reads the other disk’s data for this segment and writes it to the failing disk in an attempt to refresh the data. The firmware periodically writes the current media verification logical block address to nonvolatile memory so the media verification can continue from where it stopped prior to a power cycle.

2.4.10 Make Data Consistent
If it is enabled in the Integrated RAID firmware, the make data consistent (MDC) process starts automatically and runs in the background when you move a redundant volume from one LSI SAS-2 controller to another LSI SAS-2 controller. MDC compares the data on the primary and secondary disks. If MDC finds inconsistencies, it copies data from the primary disk to the secondary disk.
Perhaps I should not have, but after the most recent reboot and the high latency I waited a few minutes and then manually triggered a consistency check. It started but runs incredibly slow at a mere 2% progress per hour! Just over 2 MB/s! Amazing! The array is almost unused by ESXi according to esxtop with perhaps 5-10 operations per second on average (more like bursts of 20-100 every 3-5 seconds). Delay averages have increased 2-3 times, to 200-800 ms.

I've tried two different VMware drivers, and both seem to behave the same.
Code:
$ esxcli software vib update -v /vmfs/volumes/vmstore0_ssd0/scsi-mpt2sas-20.00.01.00-1OEM.550.0.0.1331820.x86_64.vib
Installation Result
  Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
   Reboot Required: true
   VIBs Installed: Avago_bootbank_scsi-mpt2sas_20.00.01.00-1OEM.550.0.0.1331820
   VIBs Removed: Avago_bootbank_scsi-mpt2sas_20.00.00.00.1vmw-1OEM.550.0.0.1331820
   VIBs Skipped:
When the array is working fine, I can reach close to 100k IOPS for 100% 4kB read, and 50k IOPS for 100% 4kB write.

Would these background tasks be triggered after a warm reboot, and not show up as ongoing in the status read by sas2ircu? For example I believe I haven't seen the background initialization status shown previously, when creating an array from sas2ircu.

As I was not booting from them, I previously had not activated the Option ROM for the LSI cards in the BIOS. Would this be a possible culprit?

Any possible way to configure this array for better performance, especially with SSDs?
I'm leaning towards the LSI card only being useful in IT mode, and perhaps I should just run the drives without RAID mirroring, or an all-in-one solution with the FreeNAS VM supplying the storage via iSCSI.
 
Last edited:

logan893

Member
Aug 12, 2016
68
12
8
44
Another idea... Could the invisible background activity, or whatever is degrading performance after reboot, be taxing the SAS2008 chip sufficiently to cause high heat and severe throttling?
Is there a way to verify the chip temperature without opening up the case?

Edit:
Nope, does not appear to be temperature related. Opened case, heat sink on the LSI card isn't too hot to touch. Put a fan next to it and there's no change in the speed of the consistency check, nor the high latency.
 
Last edited:

logan893

Member
Aug 12, 2016
68
12
8
44
vmhba0 is a Samsung 850 Pro
vmhba1 is the 2xIntel SSD DC S3700 in RAID1 via LSI SAS2008 9212-4i4e

 

logan893

Member
Aug 12, 2016
68
12
8
44
mptbios readme states
These messages may appear during the boot process:

2. "Adapter configuration may have changed, reconfiguration is suggested!"
appears if none of the information in the NVRAM is valid.
I do see this message while booting. I read on an IBM site ( Warning message with multiple LSI controllers installed - IBM BladeCenter HS12 (Type 8028) ) that it is also shown if there are multiple controllers, and the boot order has not explicitly been configured in the boot configuration utility.

An Oracle instruction ( LSI Firmware Upgrade Procedure ) states this error message can be safely ignored.

I'll update the boot order and see if the message goes away.
 

cptbjorn

Member
Aug 16, 2013
100
19
18
Do you have the LSIprovider/SMIS provider vib installed? Installing it completely ruined latency on an ESXi 6 machine of mine that has an IR SSD RAID1 datastore s0 I uninstalled it. This was around 18 months ago though so maybe they fixed it since.
 
  • Like
Reactions: logan893

logan893

Member
Aug 12, 2016
68
12
8
44
Yes, LSIprovider is installed, and was seemingly working fine and without latency issues before the reboot caused the latency to spike once more.

It also showed the same info as the sas2ircu, i.e. optimal volume status and no issues with the drives. vmkernel.log also did not show anything amiss, only the occasional failed (sense data reporting it's unsupported) SCSI message.

I rebooted, added the missing boot order information to the LSI card in IR mode, and also configured the card for passthrough. LSI BIOS showed no issues with the volume. Slotted it into a Windows VM, and the MegaRAID Storage Manager is saying there's an unrecoverable medium error, and started a rebuild. WTF?

It's half way done with the rebuild. I'll know if I need to trash and restore the whole array again...

Next on the list of things to try:
Remove LSIprovider
Update firmware to 20.00.07.00

All in all it doesn't feel very reliable.
 

cptbjorn

Member
Aug 16, 2013
100
19
18
Yeah I'm not a fan of running SSD RAID arrays on these cards as backing for VMware datastores and I don't plan to do it again unless there's a firmware patch or whatever that gives a night and day improvement.

I'm not too happy with the performance even with LSI provider removed but it doesn't affect the workload much and the machine is colo'd far away so I've decided not to touch it.
 

logan893

Member
Aug 12, 2016
68
12
8
44
Preliminary verdict: great success!

Array remains intact after reboot.
Removed LSIprovider.
Latency and performance appear to be back to normal.

4 kB at QD32:
Reads: 95k IOPS (370 MB/s)
Writes: 51k IOPS (200 MB/s)

I'll do a shutdown and cold boot, to see if it remains stable.

Edit: Yep, still works well. Low latency. Thanks for the LSIprovider information! I wouldn't have suspected this monitoring software initially, as I had it running fine for a while.
 
Last edited:

logan893

Member
Aug 12, 2016
68
12
8
44
Now that I know what to look for, I found some more people with similar experiences.

Running a Dell PERC with high latency? Check the LSI SMI-S vib! - VirtualLifestyle.nl
Update on SMI-S provider causing latency issues - VirtualLifestyle.nl

I have asked LSI and some guys inside VMware if they have any more information on this, but it’s hard to uncover any more information. LSI Support did get back to me, stating:


According to LSI Engineering department, this latency is caused by a bug in the hypervisor. The bug should be fixed in vSphere 5.1 Update 3 and 5.5 Update 2.

It seems this issue will be fixed in an upcoming release of vSphere, so I guess we need to use the work-around until then and hope the fix will actually make the 5.5 Update 2 release. I’m wondering if this issue is LSI-specific, or a more bug more widely affecting other SMI-S providers, too.
Comments to his post mention running 5.5 U2 as well as 6.0 U1 and still having latency issues with LSI SMI-S provider installed.

Also, my LSI board seem to be IBM branded according to MSM, if this makes any difference in this matter.