Best 300TB storage for $100K challenge

memphizz · Apr 29, 2022

Hi, I work for a research hospital. I would appreciate any comments or advice for file system and hardware design for a high-end storage solution. We have a few microscopes (running windows 10), each can produce 3-4GB/s of image data that needs to written to the shared storage and to be processed by small number of linux servers (say 5) near real-time. The servers also write processed files back to the storage. Minimum write bandwidth is close to 40GB/s when all the microscopes write to the storage at the same time and processing is happening. All the windows workstations and linux servers are on infiniband HDR. The file sizes are 64KB and larger.

Here is a rough list of design factors:
-lowest latency file system (FS)
-high performance with low queue depth
-high small file performance, don’t expect lots of 4KB RW
-near identical performance on windows 10 (not windows server) and linux, using RDMA on windows and linux possible?
-easy to add more drives (and servers if it can’t be avoided) without losing performance or drastic change in FS
-performance and capacity is more important than Security, we can take some risk but we don’t want to deal with failures every couple of months

Should I consider other factors?

For hardware, would it make sense to go with single storage server loaded with NVME (20x15TB) and software raid? What would be the best FS to go with?

dandanio · Apr 29, 2022

Use Silk/Kaminario. You will thank me later.
(not affiliated, just a customer).

MBastian · Apr 29, 2022

First thing that comes to my mind. Do you really need one shared storage? It is not nice to have multiple read and write I/O streams hammering on one storage backend. As far as I can tell from your description it should be possible to have one dedicated volume per (one or two?) microscope(s) for the raw data and one more dedicated volume the linux server can write their processed data to. Depending on your shared storage solution you probably could host multiple or even all needed volumes(as in: Separate controllers and drives) on one physical box.

memphizz · Apr 29, 2022

dandanio said:
Use Silk/Kaminario. You will thank me later.
(not affiliated, just a customer).

Are you using Silk with on prem NVME? Would be great if you could share your application.

memphizz · Apr 29, 2022

MBastian said:
First thing that comes to my mind. Do you really need one shared storage? It is not nice to have multiple read and write I/O streams hammering on one storage backend. As far as I can tell from your description it should be possible to have one dedicated volume per (one or two?) microscope(s) for the raw data and one more dedicated volume the linux server can write their processed data to. Depending on your shared storage solution you probably could host multiple or even all needed volumes(as in: Separate controllers and drives) on one physical box.

Valid point, the streams are independent, so separate servers and volumes is a better option. Any files system that you would recommend?

Sean Ho · Apr 29, 2022

Seems like your workload might not really need the features of a full POSIX fs; consider object storage? Or perhaps NVMeoF?

Is the image processing pipeline stream-oriented? It seems a bit inefficient for the microscope controllers to write a bunch of data to remote flash storage only for the compute nodes to immediately read them, rather than streaming directly to RAM on the compute nodes. If archives are needed, that can be done from the compute nodes.

memphizz · Apr 29, 2022

Sean Ho said:
Seems like your workload might not really need the features of a full POSIX fs; consider object storage? Or perhaps NVMeoF?

Is the image processing pipeline stream-oriented? It seems a bit inefficient for the microscope controllers to write a bunch of data to remote flash storage only for the compute nodes to immediately read them, rather than streaming directly to RAM on the compute nodes. If archives are needed, that can be done from the compute nodes.

Sean, thanks for the suggestions. Streaming directly to RAM makes sense, but I don't have any experience. Do you know of any road blocks if the image processing pipeline is in python? is it possible to stream directly to RAM on multiple nodes to parallel process the data?

Sean Ho · Apr 29, 2022

It'd really need the support of the software devs. If the image processing has a hard real-time requirement, then the flash storage is only being used as a buffer anyway, so might as well use, e.g., RDMA to stream directly from RAM to RAM (python-rdma). Or TCP if the ease of dev work is worth the additional host overhead.

It depends on what the image processing is. If it's, e.g., stacking of 2D confocal microscopy, that's super simple computationally, just need enough RAM to hold two frames' worth of data, using an online/streaming median algorithm.

memphizz · Apr 29, 2022

Sean Ho said:
It'd really need the support of the software devs. If the image processing has a hard real-time requirement, then the flash storage is only being used as a buffer anyway, so might as well use, e.g., RDMA to stream directly from RAM to RAM (python-rdma). Or TCP if the ease of dev work is worth the additional host overhead.

It depends on what the image processing is. If it's, e.g., stacking of 2D confocal microscopy, that's super simple computationally, just need enough RAM to hold two frames' worth of data, using an online/streaming median algorithm.

Sean, I just realized that none of the detector vendors for the microscopes support RDMA. The processing is mainly for SIM, lattice light sheet and super res imaging. Next time I meet with them I suggest this feature.

MBastian · Apr 30, 2022

As @Sean Ho wrote: RDMA is just one option. A feature request would probably take months or even years to implement.
Am I guessing right, that these SIM microscopes are pushing out images and not a video stream? If yes, there is really no need to write the raw images to flash.
A simple-stupid "solution" would be to export a sufficiently sized RAM Disk on your Linux computation nodes as an NFS share and let the microscope(s) dump their frames there. Thus the software can fetch, compute, push result to storage and delete the raw images in one go.
Ok, that's a really simple and probably really stupid setup I'd only seriously consider for a legacy software environment. There are much nicer solutions available depending on your image processing softwares capabilities.

Sean Ho · Apr 30, 2022

ramdisk is really not a bad idea! you might still be able to fit rdma in there with SMB Direct or rpcrdma (nfs), dunno.

memphizz · May 1, 2022

MBastian said:
As @Sean Ho wrote: RDMA is just one option. A feature request would probably take months or even years to implement.
Am I guessing right, that these SIM microscopes are pushing out images and not a video stream? If yes, there is really no need to write the raw images to flash.
A simple-stupid "solution" would be to export a sufficiently sized RAM Disk on your Linux computation nodes as an NFS share and let the microscope(s) dump their frames there. Thus the software can fetch, compute, push result to storage and delete the raw images in one go.
Ok, that's a really simple and probably really stupid setup I'd only seriously consider for a legacy software environment. There are much nicer solutions available depending on your image processing softwares capabilities.

I like the idea of ramdisk, especially sharing that with other compute nodes should be efficient too. I think SMB direct is supported by windows 10.

memphizz · May 4, 2022

MBastian said:
As @Sean Ho wrote: RDMA is just one option. A feature request would probably take months or even years to implement.
Am I guessing right, that these SIM microscopes are pushing out images and not a video stream? If yes, there is really no need to write the raw images to flash.
A simple-stupid "solution" would be to export a sufficiently sized RAM Disk on your Linux computation nodes as an NFS share and let the microscope(s) dump their frames there. Thus the software can fetch, compute, push result to storage and delete the raw images in one go.
Ok, that's a really simple and probably really stupid setup I'd only seriously consider for a legacy software environment. There are much nicer solutions available depending on your image processing softwares capabilities.

I Found rram-linux, sounds very close to what you are suggesting. How would you do it?

Sean Ho · May 4, 2022

Yes, from a cursory glance that looks like a set of scripts outlining what you could do manually to set it up: create ramdisk, configure for NFS export, enable NFS over RDMA (rpcrdma has superseded svcrdma/xprtrdma), mount from client. The bits with loopback block dev and mdadm are optional.

dandanio · May 7, 2022

memphizz said:
Are you using Silk with on prem NVME? Would be great if you could share your application.

I use Kaminario on prem in 2 DC's and a similar product although rebranded Silk in GCP in multiple regions. It scales both horizontally and vertically and by increasing the number of controllers you increase the IOPS. Flexible, a true workhorse. It easily can handle your required IOPS. We use it as an Oracle backend for some sizable databases for a real time data lookups. If you are serious about exploring more, hit me up in privately, no need to derail the conversation.

Search

Best 300TB storage for $100K challenge

memphizz

New Member

dandanio

Active Member

MBastian

Active Member

memphizz

New Member

memphizz

New Member

Sean Ho

seanho.com

memphizz

New Member

Sean Ho

seanho.com

memphizz

New Member

MBastian

Active Member

Sean Ho

seanho.com

memphizz

New Member

memphizz

New Member

Sean Ho

seanho.com

dandanio

Active Member