MELTDOWN

dragonme · Feb 14, 2019

ok people... its about to get real.. real technical.. in a way that I am not professionally prepared for but perhaps will spark some enlightened debate and possibly flush some best practices

without posting tons of smpt and esxi performance graphs and esxtop results.. lets just say that I have seen shifts in esxi / napp-it / zfs workload performances since updating esxi on this older s5520hc / l5640 based server from esxi 6.0 to 6.0u3.

assumptions..

1 > that esxi base code has evolved to software patch specter / meltdown
2> that having not updated my motherboard bios/microcode since this came out is not helping/ maybe it is

here is an excerpt from an article as I started researching WHY my box of late has been showing CPU and moreover interrupt numbers that are off the chart literally compared to historical numbers for the same server a year or more ago. Its not uncommon and actually routine now to see napp-it vm not just spike but hold over 30,000 interrupts per second while doing 1GBE wired file transfers via SMB into napp-it ... you heard that right.. 30k. additionally, napp-it has 4vcpu on a box that only subscribes 16vcpu out of 24 available but yet during these transfers.. see the napp-it vm sit at a solid 50% cpu use.. IE is monopolizing 1 core 100%/ both threads. pool wait and busy numbers are way below 50% so its not disk / pool related ...

I believe the issue is that the patches for these 2 BUGS was a fundamental way of re-mapping how system calls, context switching, and interrupt handling as well as memory page swaging between user space and kernel space... is delt with.. changed to mitigate the exposure to these security issues..

what is unclear to my.. at a technical level.. is that IF esxi 6.0u3 running specter patches on a board/cpu micro code that has not been bios updated is making it worse.. of if the micro code was also changed.. performance would take further hits.

here is the excerpt - also in the article a user that was operating in a zfs - lz4 - database environment saw performance hits between 20-40 %

Remediating Meltdown – which is present in modern Intel processors – involves enforcing complete separation between user processes' virtual memory spaces and the kernel's virtual memory areas. Rather than map the kernel into the top portion of every process's virtual memory space where it remains invisible unless required to handle an interrupt or system call, the kernel is moved to a separate virtual address space and context. This fix prevents malware from exploiting the Meltdown CPU bug to read kernel memory from user mode, and is referred to as Kernel Page Table Isolation.

Switching back and forth between these contexts – from the user process context to the kernel context and back to the user process – involves reloading page tables, one set describing the user process and another describing the kernel. These tables map the process or kernel's virtual memory to physical blocks of RAM or swap space.

These context switches from user process to kernel to process not only takes time, it also flushes any cached virtual-to-physical memory translations, all in all causing a performance hit, particularly on workloads that involve a lot of IO or system calls. But with PCID, there's no need to flush the entire translation lookaside buffer (TLB) cache on every context switch as selected TLB entries can be retained in the processor.

OK ... hash it out!!

Evan · Feb 14, 2019

Very casual observations...

on older gear (notebooks running windows) was that it was the bios update with fixes that had a huge impact.

On the other hand no appreciable impact with ESX servers (although e5 of all generations so not that old) with either bios or OS updates but no storage workload of significance, so no storage services running on ESX.

Looks like you hit exactly the case with heavy IO which is by far the most impacted, in the end newer equipment will probably be the only answer to get back that performance.

dragonme · Feb 14, 2019

well given that a majority of napp-it users are using it as virtualized storage in esxi I think it warrants a technical look by folks far more versed than I.

after poking around a lot more today trying out different settings its clear that latency in the napp-it vm specifically on hardware interrupts and context switching has a HUGE roll in throughput ..

setting the VM to High latency sensitivity cut interrupts and context switching almost in half. interestingly the host overhead for processing the NIC and other hardware didn't appear to change materially, but other esxi schedulers didn't have to get involved as much and of course it cut latency for that hardware in half.

noticeable throughput improvement for 1GB wired into and out of the box is pretty much wire limits now.. even from a mac.

one thing of note, going from e1000 to vmxnet3 once latency was set to high made little to no difference.. actually with throughput the e1000 was better and cpu total usage was about the same for the VM including host overhead. one thing I did notice is that since the recommended perf tuning for the vmxnet3 is to turn off TSO .. while that might work good for internal vm to vm stuff .. didn't seem to be a good fit for throughput into the physical nic.. as you have to pay for it somewhere .. I even noticed the cpu usage increased on the mac.. so I think that TSO/LRO etc.. are settings that need to be right and I am not technically proficient enough to determine the optimal mixture but it would appear that the settings if changed in one place likely need to be changed in others.. not unlike MTU settings..

more to follow I hope

dragonme · Feb 16, 2019

more testing today

with a couple VMs running and napp-it managing 2 pools.. 1 the VMs are running on and a data-pool that was used for SMB media files and VM security DVR footage being recorded.. i.e. some contention

I have 3 nics active on the napp-it vm for testing .. the

the original e1000 and vmxnet3s
the e1000 is on the physical / management network
vmxnet3s0 is the internal to esxi virtual 10gbe network .. vswitch set to 9000 and nappit tuning turning off lso and setting 9000

vmxnet3s1 is another address on the physical network / management

testing during contention showed both the e1000 and vmxnet3 almost virtually the same.. about 80MB/s upload into nappit and 120MBs or so downloading from napp-it to the OS X desktop over wired Gbe

host cpu use was about the same ..

a slight edge to the vmxnet3s but statistically about the same...

I decided to turn offloading back on for the vmxnet3s1 and obviously its set for 1500 not 9000..

offloading seemed to help marginally.

I still think that 2 things are in play here

the later versions of esxi have meltdown and heart bleed code updates that require far more context switching and hardware interrupts are way up.. even sitting mostly idle... and the separation of memory spaces and other 'fixes' really hurt storage VMs where there is a lot of hardware, storage, network involvement that requires massive separation of user and system user spaces...

I give napp-it 4 cpu as it uses 80% of those compute resources during send-recv between the data pool and a backup pool that runs well over 350MBs all intrabox

during 100MBs file uploading and downloading from esxi/napp it to physical machines .. cpu usage hits 60+% .. and most if it is networking related..

vm wait and ready numbers in esxtop are good .. below 1% .. but its a lot higher than similar testing done more than a year ago..

I think that TSO / LSO / LRO has improved.. and while internal virtual networking might work better with it off and jumbo on .. I think that anything that reduces packet rate on physical adapters hence reducing context/interrupts between napp-it and hardware had marked performance improvement and consideration should be given to keep these offloads on so that compute resources are offloaded to the nic and larger blocks of info can be sent.. with proper TSO .. packet sizes can be sent larger than MTU and the physical NIC / host compute is used rather than vm user space compute

studies at solaris supposedly show that TSO on and 1500 mtu performs better than TSO off and 9000 jumbo frames... didn't specify in the article that I read if that was virtual or over 10/40GBE wired.

Search

MELTDOWN

dragonme

Active Member

Evan

Well-Known Member

dragonme

Active Member

dragonme

Active Member