Mellanox MHGH28/29 and ESX

Chuckleb

Moderator
Mar 5, 2013
1,017
330
83
Minnesota
I had a fun night working with RimBlock on sorting out this odd issue with the MHGH28/29 series of cards and ESXi. I don't see a thread on here for this but we were both having issues with the card not working properly. Specifically it would report mkey issues, would timeout, and would not link up in ESXi 5.1. This was tested with the latest firmware, different cards, different cables, and different machines to no avail. The same setup with CentOS 6.4 would not have any issues and pair up at the full link speed, but if you rebuilt as ESX, it would revert back to not working.

The solution to this problem is to flash the cards to firmware 2.7.000 and it will then link up immediately. This is still supported in the latest Mellanox driver for the card (even though it says 2.9 or higher).

RimBlock is updating this all over (man he is prolific!) but I didn't see it here so dibs! ;) **sigh** 4 hours...
 

analog_

New Member
Apr 28, 2013
9
0
0
I experience exactly the same thing, 2.8.600, 2.9 it does not matter. I'm in the process of contacting Mellanox, I don't expect much but who knows.

Also, I can't get any performance out this thing, ramdisk or ssd volume, it's stuck at 200MB/sec for some reason. I'm starting to think zfs-on-linux with debian wheezy wasn't such a good idea after all. Problem is that I don't have a IB switch and can't get opensm to compile on OI. Hardware wise, think intel i5s, e3s and 16 to 32GB RAM. What are you guys using as fileserver OS and do you have a IB switch w/ subnet manager?

5 disk raidz
ramdisk
 

RimBlock

Member
Sep 18, 2011
782
6
18
Singapore
I had a fun night working with RimBlock on sorting out this odd issue with the MHGH28/29 series of cards and ESXi. I don't see a thread on here for this but we were both having issues with the card not working properly. Specifically it would report mkey issues, would timeout, and would not link up in ESXi 5.1. This was tested with the latest firmware, different cards, different cables, and different machines to no avail. The same setup with CentOS 6.4 would not have any issues and pair up at the full link speed, but if you rebuilt as ESX, it would revert back to not working.

The solution to this problem is to flash the cards to firmware 2.7.000 and it will then link up immediately. This is still supported in the latest Mellanox driver for the card (even though it says 2.9 or higher).

RimBlock is updating this all over (man he is prolific!) but I didn't see it here so dibs! ;) **sigh** 4 hours...
Haha,

I updated in my own home setup thread but didn't start a separate thread.

Thanks for starting on Chuckleb, saves people having to trawl through my thread to get the info :).

After putting the info up on a few places about setting up a cheap Infiniband setup and then finding an issue I had to run around and put the caveat in. I am just very glad you were able to find that an older firmware works.

Do we know what the changes in 2.8 and 2.9 are that we loose by going back to 2.7. I have not had a chance to look yet.

Oh, and have you tried the 2.7 flash with VT-d (passthrough) as I could not get the VM to start with the Infiniband card passed to it ?.

I experience exactly the same thing, 2.8.600, 2.9 it does not matter. I'm in the process of contacting Mellanox, I don't expect much but who knows.

Also, I can't get any performance out this thing, ramdisk or ssd volume, it's stuck at 200MB/sec for some reason. I'm starting to think zfs-on-linux with debian wheezy wasn't such a good idea after all. Problem is that I don't have a IB switch and can't get opensm to compile on OI. Hardware wise, think intel i5s, e3s and 16 to 32GB RAM. What are you guys using as fileserver OS and do you have a IB switch w/ subnet manager?
I am using a switch without a subnet manager but have a linux machine connected to the Infiniband network for the express purpose of subnet manager duties. It is a pain and I may look at a low cost entry level build to take over these duties rather than using one of my C6100 nodes as this seems a colossal waste of resource.

I am using Solaris 11.1 (so simple to setup Infinband with SRP in Comstar) and am sharing to ESXi 5.1 fully patched.

RB
 

Chuckleb

Moderator
Mar 5, 2013
1,017
330
83
Minnesota
I too am using a software subnet manager. I think the easiest one is a simple CentOS system and a "yum install opensm".

analong: are you exporting this over NFS or as a raw volume? I can't seem to get good speeds to ESX over NFS but rimblock can do good numbers via raw. I'm hoping that I'll have some free time in the next couple of weeks to really dig through all of the IB possibilities. I have enough machines and cards that I can not screw up my main system and can beat up on a dev environment.

rimblock: Haven't tried VT-d to a VM yet. I haven't tried, can I do IB to a VM as well as the host machine for storage and network? If you can do that, then a Linux VM with OpenSM could work ;)

Full changelog for 2.7 to 2.9 is here: http://www.mellanox.com/pdf/firmware/ConnectX-FW-2_9_1000-release_notes.pdf
 

RimBlock

Member
Sep 18, 2011
782
6
18
Singapore
rimblock: Haven't tried VT-d to a VM yet. I haven't tried, can I do IB to a VM as well as the host machine for storage and network? If you can do that, then a Linux VM with OpenSM could work ;)
That was part of my thoughts but also to test out the various protocols on different Windows versions (2008r2 / SBS 2011 / 2012) without having to reinstall each OS on bare metal.

I finally got my 29C through the post and it was recognised by ESXi, along witht he SRP targets but copying from a SRP target datastore to a local disk went very badly. I was going to look at reinstalling the Mellanox vib (again !!) but then decided to try bare metal for SBS 2011 Ess and see if I can get SRP working with it. Still got to find time for the install though. Unfortunately I suspect it may take a little time as I have just bought Dead Island Riptide for both me and my eldest son. No firearms allowed here so we have to make do with a little virtual zombie hacking instead for frustration relief :).

Thanks, will take a look.

RB
 

analog_

New Member
Apr 28, 2013
9
0
0
I'm using srp, a little bit of documentation for you guys:
Code:
root@troy:~# uname -a
Linux troy 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2 x86_64 GNU/Linux
root@troy:~# cat /etc/scst.conf
# Automatically generated by SCST Configurator v3.0.0-pre2.


HANDLER vdisk_blockio {
        DEVICE VMMLC {
                filename /dev/zvol/mlc/vm-mlc
        }
        DEVICE VMSLC {
                filename /dev/zvol/slc/vm-slc
        }
}

TARGET_DRIVER ib_srpt {
        TARGET ib_srpt_target_0 {
                enabled 1
                rel_tgt_id 1
                GROUP britta {
                        LUN 0 VMMLC
                        LUN 1 VMSLC

                        INITIATOR *
                }
        }
}
root@troy:~# zpool status
  pool: mlc
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 0h7m with 0 errors on Sat Apr 27 01:57:58 2013
config:

        NAME                                         STATE     READ WRITE CKSUM
        mlc                                          ONLINE       0     0     0
          raidz1-0                                   ONLINE       0     0     0
            ata-M4-CT128M4SSD2_000000001204032BA7A6  ONLINE       0     0     0
            ata-M4-CT128M4SSD2_000000001133031803BB  ONLINE       0     0     0
            ata-M4-CT128M4SSD2_000000001236091570A1  ONLINE       0     0     0
            ata-M4-CT128M4SSD2_0000000012360915696B  ONLINE       0     0     0

errors: No known data errors

  pool: slc
 state: ONLINE
  scan: scrub repaired 2.42M in 0h1m with 0 errors on Sat Apr 27 02:02:01 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        slc                                               ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM001000H2032HGN  ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM9485007C032HGN  ONLINE       0     0     0
          mirror-1                                        ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM001000RE032HGN  ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM952200WN032HGN  ONLINE       0     0     0
          mirror-2                                        ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM001000RH032HGN  ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM95260087032HGN  ONLINE       0     0     0
          mirror-3                                        ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM0010018H032HGN  ONLINE       0     0     0
            ata-SSDSA2SH032G1GN_INTEL_CVEM952600JY032HGN  ONLINE       0     0     0

errors: No known data errors
root@troy:~# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
mlc          153G   198G  43.4K  /volumes/mlc
mlc/vm-mlc   153G   198G   148G  -
slc         36.1G  81.0G    30K  /slc
slc/vm-slc  36.1G  81.0G  36.1G  -
Did you know using FUSE you can mount read only the raw ESXi VMFS volume. It requires you to -f it but works, maybe handy for some sort of manual administrative manoeuvre when you're in the trenches and things are getting dirty.

Also unrelated probably but if I replicate the volume over ZFS using send/receive functionality and share the volume back to ESXi it only appears on one side (the one that came on first I believe). It does show volumes unique to each side. I can't seem to get multipathing to work or how am I supposed to do this HA thing?

PS. I contacted Mellanox about this firmware thing, I don't know if i mentioned this earlier but I have four Voltaire 500Ex-d cards, 3 flashed as MHGH28-XTC and one as MHGH-XSC, why I don't know and don't know the difference between XTC and XSC either.
 
Last edited:

Smalldog

Member
Mar 18, 2013
62
2
8
Goodyear, AZ
PS. I contacted Mellanox about this firmware thing, I don't know if i mentioned this earlier but I have four Voltaire 500Ex-d cards, 3 flashed as MHGH28-XTC and one as MHGH-XSC, why I don't know and don't know the difference between XTC and XSC either.
I thought -XTC indicated tall bracket, and -XSC indicated short bracket?
 

Chuckleb

Moderator
Mar 5, 2013
1,017
330
83
Minnesota
You are correct, that's what the models mean. In case anyone needs 'em, I have an order going in for the CX4 short brackets ;)

I thought -XTC indicated tall bracket, and -XSC indicated short bracket?
Yeah, I'll fire up some SRP tests and maybe ZFS test this weekend. Sigh.
 

analog_

New Member
Apr 28, 2013
9
0
0
Mellanox replied with: we give zero shits about you (paraphrased with my intonation). Basically it isn't supported and won't ever be. I'm considering buying an IB switch. The MTEK43132 one, because it's cheap. Would that work well?

RimBlock/ChuckleB: about post of yours, is there a performance reason for running 2.9 instead of 2.7? I don't use VTd but knowing I can, if I want would be nice. Changelog only mentions RoCE performance thingies.

edit: Apparently you guys are using ISR9024D. Wallet won't be happy about the diet with no results, according to you, how big a chance would this purchase 'fix' my 200MB/sec issue? This would effectively mirror your setups afaik.
 
Last edited:

RimBlock

Member
Sep 18, 2011
782
6
18
Singapore
Mellanox replied with: we give zero shits about you (paraphrased with my intonation). Basically it isn't supported and won't ever be. I'm considering buying an IB switch. The MTEK43132 one, because it's cheap. Would that work well?

RimBlock/ChuckleB: about post of yours, is there a performance reason for running 2.9 instead of 2.7? I don't use VTd but knowing I can, if I want would be nice. Changelog only mentions RoCE performance thingies.

edit: Apparently you guys are using ISR9024D. Wallet won't be happy about the diet with no results, according to you, how big a chance would this purchase 'fix' my 200MB/sec issue? This would effectively mirror your setups afaik.
I am actually using a Flextronics F-X430046 for my own setup (DDR) but also have a Voltaire 4036 sitting here for the Hadoop cluster I am working on building. My Flextronics is pretty quiet but the Voltaire is loud. As loud if not louder than an unmodified Dell C6100.

I have not found a reliable way to benchmark between my Solaris SAN and my ESXi server so I cannot accurately say what sort of speeds I am currently getting. I am, however, playing with mounting the SRP targets as RDM on a local drive. The plan is to have the actual VM files on the local storage and the Virtual machines disks (including their boot disks) as RDM mappings to my ZFS SRP targets. My secondary drives work fine as RDM. Have not quite got round to getting the boot drive data moved over yet though.

RB
 

Chuckleb

Moderator
Mar 5, 2013
1,017
330
83
Minnesota
I am only running 2.7 on the ESX host, 2.9 everywhere else. Just freed up some time so I am back onto testing mode again. I think most of my bottleneck is IPoIB or CentOS NFS but will know for sure after I get my SRP configs to see if there is a difference.

I don't think there is a performance difference.
 

analog_

New Member
Apr 28, 2013
9
0
0
OK, Thank you. I'm going back to testing mode as well. I'll try to generate some Debian-to-Debian numbers (both have ZoL).
FYI: I tried IPoIB before and got stuck around the same 200MB/sec I vaguely remember with debian-to-windows iscsi using a ramdisk. Been a while :/
 

analog_

New Member
Apr 28, 2013
9
0
0
OK, tried SRP between two Debian boxes. Didn't get it working because apparently the mlx4 driver from the debian package doesn't do some magic with libibverbs. This means rdma_bench_lat and bw won't work. And the srp_daemon won't properly work. Strangely it has no problems using scst and ESXi as client. Ping works but quite high, around 200us.

Did some more disk benchmarking with FIO on debian. I'm only getting 30/50MB random read/write in 4K using a ZFS raid10 construction on 8 Intel X-25E drives. More and more considering ZFS-on-linux is quite slow when you have SSD volumes.
 
Last edited: