CX3 iSER

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

efschu2

Member
Feb 14, 2019
68
12
8
I would like to use scst (actually I would like to use LIO...), but havent found any good (working with ubuntu ootb w/o doing debugging all time long) ocf heartbeat script, so I'm left with tgt I guess...
Maybe I will write one by my own, but well dunno when I have time to.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
Wow, nice.

I agree about netplan. jumbo frames in particular are difficult to configure using netplan.

The bandwidth tests show very smooth curves, which looks very encouraging to me. I take it as an indication that flow control is working very consistently.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
I wonder, what is the most cost-effective kind of zil/slog device for high availability failover?

I'm not yet doing HA, partly because by using kexec, reboots take around 20 seconds, which seems to be relatively travel free for both iscsi and NFS esxi clients. I'm interested in pursuing this and how to implement the zil/slog is the one thing I'm struggling with.
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
@efschu2 , the bench results at the end, you have 3 ATTO graphs, what's the difference between them?
Also, why did you insist on sync=always? This totally destroys write performance for me. Below are my graphs with sync=standard and sync=always for comparison.
 

Attachments

efschu2

Member
Feb 14, 2019
68
12
8
Single runs with qd1, qd10, qd256 and a parallel run oft two VMs with qd256.
If your data is important you MUST use sync=always, else you loose the data which is not written to disks in case of failure. Officialy Esxi and scst can bypass sync requests, but not every Software is handling critical data correctly with sync/cache flush - so if you know your Software is doing that correctly you can go for sync=standard and if you dont care about data integrety you can go for sync=disabled.
To get good performance with sync=always you need consistant high write iops, eg a pool of enterprise ssds or add a good slog device like optane.

Btw:
If you see increased iops by sync=standard and atto bench with direct IO, this indicates your sync requests are not honored, so it acts like sync=disabled. Read carefuly about the following settings to be aware what is going on with your data:
scst:
write_through 1 vs 0
nv_cache 1 vs 0
fileio vs blockio
zfs:
sync always vs standard vs disabled

Btw2:
in my case sync=always is absolutly necessary, because if node1 dies for some reason, node2 takes over the pool and exports the target/lun, so my esxi vms does not even recognize that their storage has been offline for a short time and that the data cached in ram has not been written to persistant storage, so a running process (a database for example) "thinks" all data has been on disk, but it's not - probably this database would be corrupted

if you dont plan HA, and your storage node dies - then your vms just crash - but that would not corrupt your data, your dataloss would "only" be the last txg sync commit - this is exactly what i do in my homelab - but forsure this is a no-go for production
 
Last edited:

ASBai

New Member
Sep 23, 2019
1
0
1
Single runs with qd1, qd10, qd256 and a parallel run oft two VMs with qd256.
If your data is important you MUST use sync=always, else you loose the data which is not written to disks in case of failure. Officialy Esxi and scst can bypass sync requests, but not every Software is handling critical data correctly with sync/cache flush - so if you know your Software is doing that correctly you can go for sync=standard and if you dont care about data integrety you can go for sync=disabled.
To get good performance with sync=always you need consistant high write iops, eg a pool of enterprise ssds or add a good slog device like optane.

Btw:
If you see increased iops by sync=standard and atto bench with direct IO, this indicates your sync requests are not honored, so it acts like sync=disabled. Read carefuly about the following settings to be aware what is going on with your data:
scst:
write_through 1 vs 0
nv_cache 1 vs 0
fileio vs blockio
zfs:
sync always vs standard vs disabled

Btw2:
in my case sync=always is absolutly necessary, because if node1 dies for some reason, node2 takes over the pool and exports the target/lun, so my esxi vms does not even recognize that their storage has been offline for a short time and that the data cached in ram has not been written to persistant storage, so a running process (a database for example) "thinks" all data has been on disk, but it's not - probably this database would be corrupted

if you dont plan HA, and your storage node dies - then your vms just crash - but that would not corrupt your data, your dataloss would "only" be the last txg sync commit - this is exactly what i do in my homelab - but forsure this is a no-go for production
According to the ZFS doc I have seen:
sync=standard
This is the default option. Synchronous file system transactions
(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
and then secondly all devices written are flushed to ensure
the data is stable (not cached by device controllers).


sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by a system call return.
This obviously has a big performance penalty.

sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.
The standard behavior is enough to ANY Serious software such as nearly all modern database and file systems. Because any serious software will expect to ensure data has been written to the disk *ONLY* after the corresponding fsync (POSIX) / FlushFileBuffers (Windows) call returns.

If a power failure or other failure occurs before fsync returns, the software will ensure data consistency through binlog rollback and other technologies when it is started next time (whether or not it is on the same node).

This also applies to NFS: NFS's async mount also provides reliability guarantees for calls such as fsync:

The sync mount option

The NFS client treats the sync mount option differently than some other file systems (refer to mount(8) for a description of the generic sync and async mount options). If neither sync nor async is specified (or if the async option is specified), the NFS client delays sending application writes to the server until any of these events occur:
Memory pressure forces reclamation of system memory resources.
An application flushes file data explicitly with sync(2), msync(2), or fsync(3).


An application closes a file with close(2).

The file is locked/unlocked via fcntl(2).

In other words, under normal circumstances, data written by an application may not immediately appear on the server that hosts the file.
If the sync option is specified on a mount point, any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space. This provides greater data cache coherence among clients, but at a significant performance cost.

Applications can use the O_SYNC open flag to force application writes to individual files to go to the server immediately without the use of the sync mount option.
So ZFS's Standard and NFS's async options provide exactly the same consistency guarantees as the filesystems you use on your local disk (local disk writes are only placed in memory buffer before fsync is called too).

Of course, you may lose the last transaction on the database or file system that has not been successfully committed (fsynced) in the event of a power loss, but there will be no data corruption. This is by desgin and cannot be avoided even if sync mode is turned on. And no doubt, the database or file system does not assume that the transaction has committed before the fsync call returns (they will not notify their user that the operation has completed successfully before fsync returns).

Therefore, setting the sync mode is exactly the same as completely disabling the write buffer of the local disk, which is completely unnecessary for almost all serious modern software.
 
Last edited:
Apr 21, 2016
56
25
18
43
Hi there, since bricking my SX6015 (IB unmanaged) and doing zfs + zvol + scst + srpt -> esxi srp with 1.8.2.5 drv on CX3, I've had to move to an ethernet setup - a friend borrowed me an arista w 100G ports - couldn't refuse :). I've had countless bad hours because of the cx3 iser hang bug. I couldn't find what triggers the hang so here are some point questions that I think would be helpful for me and others going this route :

1. Does anyone have a clear cumulative conditions that trigger the iser hang bug with cx3 on esxi ?

2. Does anyone have some insights using mlnx ofed driver versions that work with centos 7/8 ? Also insights on using inbox drivers I think would be useful.
3. The environment that I've come to is :

nodes - initiators - esxi 6.7 (latest update) w mlnx ofed 3.17.70.1-1OEM.670.0.0.8169922 driver, iser on a single port without any other service running alongside, cx3 pro adapter 2.42.5700 fw - hp flr adapter.

storage - target Centos 7 target running latest from git on zfs and scst - 3.10 centos kernel and inbox drivers - cx3 adapters running 2.42.5000 bin fw from mellanox (no pros here).

This setup hangs ...

What is different from the esxi documented bug is that a restart of the scst service (which removes/inserts the modules and loads the config) restores the iser connection on the vsphere nodes.
Has anyone made some sense out of it ? it seems to me like the bug was "moved" in the target part of the storage.