lga3647 esxi build to host my Oracle Apps/Databases

BennyT

Active Member
Dec 1, 2018
146
39
28
successfully installed the four 4TB sata SSDs (crucial mx500) and sas9300-8i today. It was simple plug n play.

The X11DPI-NT motherboard bios has settings for each PCIe slot to be either "LEGACY" or "UEFI". I used default of LEGACY for the slot i put the HBA into.

I'm unsure but i think UEFI is only needed for a PCIe slot if i wanted to PASSTHROUGH the HBA device to a VM. I don't need passthrough and so i left it as LEGACY.

The added storage was much needed and being SSDs it's so much faster doing Oracle RMAN DB duplications across my VMs, was great. I'd love to switch to all flash storage eventually.
When the nvme SSDs and pcie x16 m.2 card get here, that'll be fun.
 

BennyT

Active Member
Dec 1, 2018
146
39
28
Successfully installed and configured our PCIE M.2 card into Slot#6 using CPU2 IOU1 set to bifurcation x4x4x4x4. See my attached diagram.

I may try experiment to move it to slot#2 (IOU1) for CPU1, but since I've already allocated the two m.2 drives to Datastores, I'm unsure if moving the card to a different PCI-E slot will affect that in ESXi.
2021-11-29_10-23-20.png
 
Last edited:
  • Like
Reactions: itronin

BennyT

Active Member
Dec 1, 2018
146
39
28
Hello, I'm back with finding weird problems in my ESXi system


Sometimes I have good performance from the SATA SSDs in ESXi VM guests:
2021-11-30_19-14-30.png


and sometimes it's not good. I wonder if writing high volumes (4GB write test in above screenshot) caused this to bog on subsequent tests:
2021-11-30_19-16-53.png

I'm not doing anything fancy. I've set a single SATA SSD, connected to LSI9300 (legacy pcie in bios), no ESXi passthrough to guest VMs

The SSD makes up a single datastore by itself, no other storage devices are in that datastore..

The linux VM BRTAD18 shown in the above screenshot has been allocated a single virtual disk:

/dev/sda (with three partitions sda1, sda2, sda3)

If I reboot the linux VM and re-run my write speed tests, I get acceptable performance again
2021-11-30_19-28-23.png

it's really weird. I first noticed this, not trying to do speed tests but during an Oracle Cloning process. I was seeing write speeds top at 30 and 40 MBps in linux iotop and I was wondering if it was normal, but was kinda expecting better. That's when I began running these speed tests to see if I had an ESXi driver problem or something weird like that.


I've not built anything onto my NVMe SSDs but that would be major bummer if I just spent $ on new SSD storage and they top at 40MBps most of the time.

Any advice or diagnosis ideas from the ESXi experts? Does ESXi 6.7 act unpredictable when using SATA SSDs? That would be bad and I can't believe that is normal. There must be a fix or explanation.

Do I need to change something in the supermicro bios or maybe something in the lsi card's bios?

Here is what my boot screen looks like when it sees the LSI card as I reboot the ESXi host server (not sure it shows anything helpful to ya):2021-11-24_12-56-44.png

Maybe this is where I need to consider learning how to paasthrough the ssd devices to a virtual NAS instead of direct attached storage in ESXI? I don't really know what I'm talking about, never built a san or nas.
 
Last edited:

BennyT

Active Member
Dec 1, 2018
146
39
28
It really might help if you'd provide details on the SSD(s) being used;)
Crucial mx500 sata 4TB . Product#is in that last lsi bios screenshot. Crucial ssd firmware is M3CR044 (latest version i believe), seen as 044 in that screenshot

I'm going to try some tests from esxcli (ssh into esxi Linux shell) command line and try similar dd test to see if i get better results. Trying to narrow the location of the problem by testing outside a vm.

*Edit: i may need to examine closer how i configured the storage controllers (there are a few different vmware scsi settings to choose from) for the virtual disks when i created the vm.

Also, i had created this vm by cloning an earlier Linux vm. I will create all new VM to avoid mistake of presumption that the vm i had cloned was good. I will also set the disk to Thick Eager, incase the issue is from provisioning space into the virtual disk.

Its about 5:20am now and this is when i think the best as I'm drinking coffee, lol
 
Last edited:

uldise

Active Member
Jul 2, 2020
165
51
28
what about trim on these SSDs? as far as i know(well, have not used ESXi for a while) ESXi wont support trim as you see it in Linux, or are this changed on latest versions?
 

BennyT

Active Member
Dec 1, 2018
146
39
28
what about trim on these SSDs? as far as i know(well, have not used ESXi for a while) ESXi wont support trim as you see it in Linux, or are this changed on latest versions?
TRIM? I don't think it's absence is causing my problems currently, but it might be. I don't think I've collected garbage on the SSD to affect performance like this and SSD does perform well at above 300-400MBps writes sometimes. Reads are even faster at above 500MBps and never a problem. Seems just the writes suffer sometimes and fall below 40 MBps and seem to stay there until I reboot the VM, which leads me to believe it's a problem with how I configured the VM.

The Crucial SSD's firmware has automatic garbage collection, but only when SSD is idle. What does the SSD firmware consider as "idle"? There is still activity happening in an OS even when users are not logged in. Crucial puts that garbage collection in their firmware for OS that do not support TRIM. I don't know if it is working or operating though.

Inside a Linux VM, if I run "lsblk -D" or "lsblk --discard" it will show DISC-GRAN and DISC-MAX values. If they return as 0B for a device then that means the Linux VM doesn't think the device supports TRIM.
2021-12-01_10-00-58.png
so even though the storage device is marked as FLASH inside of ESXi, the guest OS doesn't recognize that the storage device support TRIM.

But there is a UNMAP (for SAS devices) and TRIM (for SATA) enabled by the "Space Reclamation" feature in vCenter 6.7u1 and in vSAN 6.7u1

From VMware:

VMware vSAN 6.7U1 introduces automated space reclamation support with TRIM and SCSI UNMAP support. SCSI UNMAP and the ATA TRIM command enable the guest OS or file system to notify that back-end storage that a block is no longer in use and may be reclaimed.

I can see that Space Reclamation in vCenter under Datastore --> Configure --> General
2021-12-01_10-10-38.png

I'm still experimenting. My next step is to create a small Linux VM, and try different virtual SCSI controllers and also provision as THICK EAGER ZEROING. I'm wondering if that was my problem by not having THICK EAGER it was probably trying to provision and zero byte the disk space prior to allowing writes to occur, which seems like that would be very bad for performance.
 
Last edited:

BennyT

Active Member
Dec 1, 2018
146
39
28
I believe the problem of slow write speeds to the SSD was from "Thick Lazy Zeroing" and also "Thin" provisioning disk. ESXi was trying to grow/expand the virtual disk which basically kills the SSD write performance. And once that growing/expansion begins, even after the expansion completes, the SSD writes would still not improve above 40MBps until I rebooted the VM. For example, if I tested a write speed test on a 5 or 6GB size file by using 64k block size and count=10k on a lazy zeroed virtual disk, it would cause ESXi to grow the virtual disk and that destroyed performance. Then, even small write tests of 4k block size and count=1k would still be under 50MBps

Provisioning a new Linux VM with "Thick Eager Zeroing" fixed that problem. It doesn't have to grow/expand. I was getting 450-500MBps write speeds, no problem. I tested numerous times with varying size blocks and counts. Even writing up to 7+GB sample data file I had no problems.

I'm so thankful!
2021-12-01_14-07-32.png



Next up is to test the NVMe. Those should really fly. I think I'll need to build the VMs using the VMware NVMe controller instead of the VMware Paravirtual SCSI controller.

Thank you
 
Last edited:
  • Like
Reactions: uldise and Rand__

BennyT

Active Member
Dec 1, 2018
146
39
28
Finally upgraded to vSphere ESXi to 7.0u3c (build 19193900)

Upgrade Overview
Upgraded vCenter Server Appliance from 7.0.2.00100 to latest 7.0.3.0030 (vcsa 7.0u3c)
Upgraded Veeam from 11.0.0.837 to 11.0.1.1261 which support latest vSphere 7.03u3

Upgraded ESXi from 6.7u1ep06 to 7.0u3c...
ESXi upgrade was harder than I thought it would be. I had to research and document my plan for recovery incase the ESXi upgrade failed. I've not performed a recovery for a failed ESXi host before. I ended up retrying over and over again and each time I'd recover and restore config back to ESXi v6.7u1, probably a million times before finally succeeding to ESXi v7.0u3c.

I did this in my off hours over the course of about two weeks. Before each day was over I'd recover back to v6.7u1 just incase I couldn't make time to get back to it again for awhile. I've written myself a nice book/doc with detailed notes and screenshots of the entire ordeal including the failures, diagnosis, Supermicro support communication and the final success of the upgrade. If interested I can post that doc here later. I''ll just summarize for now.

- first attempt to upgrade was to use VCSA (vCenter Server Appliance) Lifecycle Manager (Update Manager in older versions). I found out I cannot use that method. Remediation said I cannot use vcsa to perform the ESXi upgrade if that vcsa is running from a VM of the ESXi host being upgraded.

- second attempt to upgrade was to use ESXi linux shell command line. That looked alot better but then...

After rebooting, ESXi v7.0u3c starts up (as viewed from IPMI html5 console) and shows latest version in ESXi yellow console. But then I notice the back of the server, the on board network Intel x722 10GBASE-T ports LEDs are off and the links are down. And cannot get to the ESXi GUI login url page or vCenter or any VMs.

Using IPMI which allows me access to interact with the ESXi yellow Console. The management network show as down. The two NICs show as being deactivated.

I shell into the ESXi linux command line from the ESXi Console

Below are two screenshots. one of the old v6.7u1 with working adapters. the other showing the adapters after the v7.0u3c upgrade
2022-03-25_19-47-37.png
2022-03-25_19-48-34.png
checking vmware compatibility guide I thought the firmware version v3.33 (firmware, not the driver) might be too low
2022-03-17_10-57-10.png
we can see in the above chart that the v7.0u3 vmware inbox driver is 1.11.1.31 and that is also what shows as my i40en driver version after the upgrade. Notice the firmware version for that driver in the chart is N/A (Not Applicable). But I was worried that the reason for link being down might be from the firmware 3.33 being too low. So i opened a ticket with Supermicro and they send me an Intel NVM firmware update for the x722 that would bring it to firmware v4.11

The utility sent from Supermicro is a small .zip file containing uefi shell scripts. I was instructed to format a FAT32 USB thumb drive and copy the scripts to it. Doesn't have to be bootable. Then reboot into BIOS and the UEFI shell and run the scripts per their instructions...
2022-03-24_10-52-34.png
But that didn't solve the problem.
2022-03-24_12-25-01.png
I then removed the i40en v1.11.1.31 driver and downgraded to driver version 1.10.9.0... then to 1.10.6... all the way down to 1.8.6. None of the drivers with firmware 4.11 (or the old 3.33 firmware) worked. The link status remained down. In other words, the adapters didn't think they had cables connected. really weird.

Then, about ready to recover back to v6.7 again, I figured, why don't I try a fresh install instead of an Upgrade. Perhaps it's not a driver or firmware issue. Perhaps it is something else in my v6.7 configuration that the v7 vmnic0 and vmnic1 physical adapter settings didin't like.

Instead of doing a fresh install, the fastest way to get back to a vanilla ESXi default settings and configuration is to use the yellow ESXi Console via IPMI HTML5.
2022-03-25_20-30-17.png
That doesn't mess up the VMs or datastores or anything, it just clears out customized configuration such as a screwed up network.

That fixed the link status!!! From there I decided to restore back to ESXi v.6.7u1ep06 and take screenshot captures of ALL the ESXi GUI screens, especially the vNetworks, vSwitchs, the physical adapter configs for vmnic0, vmnic1 (the two onboard x722 10GBASE-T ethernet ports).

Then I did a fresh install again. The network adapters remained up. I began to rebuild my ESXi configuration from the screenshots. This is what I found.

NETWORK DIAGNOSIS AND SOLUTION:

in the old ESXi v6.7 configuration, there were two link speeds I can choose from for the 10GBASE-T adapters:
1000Mbps; 10000Mbps
2022-03-25_19-47-37.png


I had set the vmnic0 and vmnic1 link speed to be 10000Mbps (10Gbps). But I only have a 1000Mbps (1Gbps) physical switch, v6.7 isokay with that selection and automatically auto-negotiate down to 1000Mbps even though I selected 10000Mbps link speed for those ports.

*The reason I had set it as 10000Mbps in v6.7 was for when I hope to eventually get a switch to handle 10Gbps. Note: in v6.7 did not have a specific "Auto-Negotiate" option.

2022-03-25_20-56-20.png
In esxi v7, there is a specific option for Auto-Negotiate link speed. But in v6.7 there wasn't, and I had it set to 10000Mbps. And if I Upgrade from v6.7 to v7, that configuration is carried over to v7. v7 will set the physical NIC to Link Speed of "10000Mbps". The i40en Driver will NOT auto-negotiate down to 1Gbps because the configuration isn't explicitly set to Auto-Negotiate... it is at 10000Mbps. And since that speed cannot be achieved because of 1Gb hardware switch , the driver sets Link Status to be DOWN

Once that was figured out, we're now at a fully working v7.0u3c. Honestly I don't notice any difference between 6.7 and v7 yet. Was it worth it? :) of course!
 

Attachments

Last edited:
  • Like
Reactions: itronin and Rand__

BennyT

Active Member
Dec 1, 2018
146
39
28
*UPDATE: This was a NON-ISSUE. My stress testing of CPU All Cores Turbo was using mprime, and that uses AVX-512 and I didn't know at the time that the MaxTurbo when using AVX2-512 that the turbo is lower (2.4Ghz max for my CPU) than the usual 2.7GHZ all core turbo frequency.

*keeping my original notes here though even though this was a non issue

In ESX v6.7 our two 16 core CPUs (64 logical cores total) could easily turbo a linux VMs cores to 2.7 Ghz each and sometimes over 3GHz.

In ESX v7 the CPU cores won't turbo past 2.1Ghz, which is it's base clock speed. We've plenty of idle cores in the host during those tests.

I'm seeing these CPU frequencies in vCenter CPU MHz reports for this linux VM as I ran mprime torture test (that's prime95 but for linux)

2022-04-08_20-37-14.png
ESXi V7 Power Management Policy is set to "Balanced", same as it was set for us in v6.7. I tried setting to High Performance" but it didn't make a difference in testing. Maybe I needed to reboot the host for that change to work (I'll test that after a reboot tomorrow). ESXi Power Management is only relevant if the BIOS' Power Performance Turning option is set to "OS Controls".

I had not changed BIOS settings since 2019.
2022-04-08_21-11-29.png

Advanced Power Management Configuration
- Power Technology = [Custom] -- other options are "Energy Efficient" and "Disable".
- Power Performance Tuning = [OS Controls EPB] -- other options are "BIOS Controls EPB".
- Energy Performance BIAS Configuration = [Performance] -- other options are "Max Performance", "Balanced Performance", "Balanced Power", "Power"
*note: The Energy Performance BIAS Config (EPB config) setting is greyed out and unable to be adjusted when Power Performance is set to "OS Controls EPB" . I believe that is because EPB config is irrelevant if the OS control the CPU core's frequencies.

This weekend I'm going to try experimenting in BIOS and change those above settings. I did this BIOS experiment in 2019 and I thought I had it figured out, but that was with ESXi v6.7

I'm going to first try:
- Power Technology = [Custom] *same as it was before
- Power Performance Tuning = [BIOS Controls EPB] -- *instead of "OS Controls EPB"
- Energy Performance BIAS Configuration (EPB) = [Max Performance] -- *this setting should allow adjusting now that we changed to "BIOS Controls"

Then I'll reboot and test the linux vm using mPrime and log the findings to see if core frequencies are higher

Then I'll return to the BIOS and change:
- Power Technology = [Disable] *instead of "Custom"
- Power Performance Tuning = [BIOS Controls EPB] *this will probably be greyed out and not adjustable because of Disabled Power Tech.
- Energy Performance BIAS Configuration (EPB) = [Max Performance] -- *I suspect this setting will also be greyed out

I may also experiment with adjusting BIOS C and P states. I don't want to go crazy with BIOS changes or settings. I may also consider a BIOS upgrade, but that is last resort. My server's performance isn't suffering at all in real world conditions for what I use it for, but I would like to see those higher core frequencies in testing.

I'll take screenshots of my test results and BIOS settings this weekend because a picture is worth a thousand words.
 
Last edited:

BennyT

Active Member
Dec 1, 2018
146
39
28
My problem is that I didn't understand AVX2-512. mprime uses that in it's stress tests

Xeon GOld 6130 is 2.1 Ghz. Max Turbo all cores is 2.7 Ghz. Max Turbo w/AVX2-512 is 2.4 Ghz. Max Turbo single core is 3.7 Ghz.

Using the ESXi command line esxtop from the vmwarre kb 80610 to test max turbo for single core

cat /dev/urandom | md5sum

esxtop shows a couple of the hyperthreads were reaching 160+% and one thread was 175.2%

calculate nominal clock speed of the Xeon 6130, multiplied by the esxtop reported performance percentage: 2.1Ghz * 1.752 = 3.6792 Ghz

That is the correct max turbo frequency of a single core for the 6130: 3.7 Ghz


I startup a single VM with 8 threads and run mprime in that linux vm while also checking the Cpu thread percentages in seperate esx ssh window.
Most of the esxtop threads were around 47% (0.98Ghz). The ones that were higher I presumed to be those that are assigned to the VM readings showed 108% - 110%. (2.26Ghz - 2.31Ghz) with one CPU thread consistently hovering around 143-146% (3.06 Ghz). I was expecting those to be more around 2.7 Ghz (all cores turbo), but I THINK mprime is testing CPUs and with AVX2 and that is why those allocated vCPUs didn't go above 2.4Ghz
1649532226084.png 2022-04-09_13-05-31.png

I shutdown that VM and bring up the vCenter VM. vCenter appliance uses ALOT of CPU all on it's own for about 10 minutes as it boots up. It showed more of what I'd expect, between 100% and 168% for the utilized cores.

They were in the 130-150% range (2.73Ghz, 3Ghz, 3.2 Ghz). Thats a good result and close to what I expected. Apparently vCenter VM doesn't use AVX2 feature of the CPU and that is why it could turbo higher. As the VM settled down after about 10 minutes, most of the VMs allocated cores went down to around <1Ghz with one thread at 1.09% (2.28Ghz)
2022-04-09_13-33-06.png

Starting up five VMs all at once, not using mprime to test, just letting them power up normally

excellent use of the CPU frequencies and required no changes to my BIOS or to ESXi power management profile (still set to balanced). Actually, this is better than I expected
2022-04-09_14-38-51.png
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,193
1,544
113
Glad you were able to sort it out (and could confirm that its not ESXi7 related:))
 
  • Like
Reactions: BennyT

BennyT

Active Member
Dec 1, 2018
146
39
28
Glad you were able to sort it out (and could confirm that its not ESXi7 related:))
Thank you. Yes, I'm glad too. I thought i had found a problem with how esxi v7 requested C states from the hardware. But it was fine and it was just the mprime version i was using added avx-512 to their torture testing. I think the older version of prime 95 i tested with years ago didn't test using avx-512.

For awhile I was thinking about recovering back to v6.7, which isn't hard to do. I would'nt of actually done that though. I had already upgraded my guest VMs to machine compatibility #19. machine compatibility of guest VMs isn't reversible. I would've had to restore ALL of my guest VMs from backups when they were still at machine compatability 14 or 15. That would've been a pain in the butt.
 
Last edited:

nabsltd

Active Member
Jan 26, 2022
158
88
28
machine compatibility of guest VMs isn't reversible. I would've had to restore ALL of my guest VMs from backups when they were still at machine compatability 14 or 15. That would've been a pain in the butt.
Some important things to remember:
  1. VM hardware version is saved as part of the snapshot process (because the .VMX file is part of the snapshot), so snapshot your VMs before you upgrade the hardware and a rollback is easy.
  2. If you run into the situation where you have upgraded the VM hardware and don't have a snapshot or backup and need to revert, just take a screenshot of the config, remove the VM from inventory, go into the datastore and delete everything but the VMDK files, and then create a new VM with older hardware version but using the existing disks. Unless you have an OS that is really broken, the slight change to the hardware when you downgrade should have no effect. The advantage to this is that you don't have to spend time restoring disks that don't really need to be restored.
  3. Since the disks don't need to be restored, if you do have a backup and can just restore the .VMX file, it should be enough, assuming you haven't created snapshots of the VM in the meantime. If so, you should be able to consolidate and then restore just the .VMX file.
 
  • Like
Reactions: BennyT

BennyT

Active Member
Dec 1, 2018
146
39
28
Thanks @nabsltd , that a good idea and I'll remember that solution if i ever need to go back to older machine compatibility#.
 

BennyT

Active Member
Dec 1, 2018
146
39
28
I'm researching UPS for the rack. I'm leaning to EATON 5PX Gen2 with 15amp input/output receptacles and plug. Specifically 5PX1500RTG2

I'm also considering purchase with the Eaton network card "NETWORK-M2".

Does anyone know if the 5PX or M2 card comes with an IPM license or a free version of Eaton IPM? I need the capability to safely shutdown my esxi v7 HOST, vcenter and it's VMs once the UPS switches to battery.

Is the network card necessary to do all that shutdown vm stuff or can i get by using just the included rs232 serial or USB that comes with the Ups?

Thanks!

*Update: I ordered the EATON 5PX1500RTNG2 (the N in product name means it comes with Network-M2 card). Decided to finally spend the money on this after a storm took out our Modem and Cablebox and a year or so ago a sudden power outage caused vCenter db to corrupt. I should've done this long time ago but it's nice to finally do it.

I believe Eaton has two branches of their IPM software (Intelligent Power Manager).
- IPM Versions 1.7.xx are free but still receive updates (latest release notes for IPM v1.7 are from Dec 2021- EOL for v1 is scheduled for Dec 31 2023 )
- IPM Versions 2.x are not free

Here is the Eaton link describing IPM v1 and v2. https://www.eaton.com/us/en-us/cata...r-manager-frequently-asked-questions-faq.html

I'm going to try the free version IPMv1 first. Hopefully it works okay with ESXi v7 and vCenter 7

I was able to find the link to v1.7 thanks to another user on this forum having posted it.


After the UPS gets here next week I'll test it and see how it goes.
 
Last edited: