Best 300TB storage for $100K challenge

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

memphizz

New Member
Apr 29, 2022
7
0
1
Hi, I work for a research hospital. I would appreciate any comments or advice for file system and hardware design for a high-end storage solution. We have a few microscopes (running windows 10), each can produce 3-4GB/s of image data that needs to written to the shared storage and to be processed by small number of linux servers (say 5) near real-time. The servers also write processed files back to the storage. Minimum write bandwidth is close to 40GB/s when all the microscopes write to the storage at the same time and processing is happening. All the windows workstations and linux servers are on infiniband HDR. The file sizes are 64KB and larger.

Here is a rough list of design factors:
-lowest latency file system (FS)
-high performance with low queue depth
-high small file performance, don’t expect lots of 4KB RW
-near identical performance on windows 10 (not windows server) and linux, using RDMA on windows and linux possible?
-easy to add more drives (and servers if it can’t be avoided) without losing performance or drastic change in FS
-performance and capacity is more important than Security, we can take some risk but we don’t want to deal with failures every couple of months

Should I consider other factors?

For hardware, would it make sense to go with single storage server loaded with NVME (20x15TB) and software raid? What would be the best FS to go with?
 

MBastian

Active Member
Jul 17, 2016
221
69
28
Düsseldorf, Germany
First thing that comes to my mind. Do you really need one shared storage? It is not nice to have multiple read and write I/O streams hammering on one storage backend. As far as I can tell from your description it should be possible to have one dedicated volume per (one or two?) microscope(s) for the raw data and one more dedicated volume the linux server can write their processed data to. Depending on your shared storage solution you probably could host multiple or even all needed volumes(as in: Separate controllers and drives) on one physical box.
 
  • Like
Reactions: memphizz

memphizz

New Member
Apr 29, 2022
7
0
1
First thing that comes to my mind. Do you really need one shared storage? It is not nice to have multiple read and write I/O streams hammering on one storage backend. As far as I can tell from your description it should be possible to have one dedicated volume per (one or two?) microscope(s) for the raw data and one more dedicated volume the linux server can write their processed data to. Depending on your shared storage solution you probably could host multiple or even all needed volumes(as in: Separate controllers and drives) on one physical box.
Valid point, the streams are independent, so separate servers and volumes is a better option. Any files system that you would recommend?
 

Sean Ho

seanho.com
Nov 19, 2019
822
384
63
Vancouver, BC
seanho.com
Seems like your workload might not really need the features of a full POSIX fs; consider object storage? Or perhaps NVMeoF?

Is the image processing pipeline stream-oriented? It seems a bit inefficient for the microscope controllers to write a bunch of data to remote flash storage only for the compute nodes to immediately read them, rather than streaming directly to RAM on the compute nodes. If archives are needed, that can be done from the compute nodes.
 
  • Like
Reactions: MBastian

memphizz

New Member
Apr 29, 2022
7
0
1
Seems like your workload might not really need the features of a full POSIX fs; consider object storage? Or perhaps NVMeoF?

Is the image processing pipeline stream-oriented? It seems a bit inefficient for the microscope controllers to write a bunch of data to remote flash storage only for the compute nodes to immediately read them, rather than streaming directly to RAM on the compute nodes. If archives are needed, that can be done from the compute nodes.
Sean, thanks for the suggestions. Streaming directly to RAM makes sense, but I don't have any experience. Do you know of any road blocks if the image processing pipeline is in python? is it possible to stream directly to RAM on multiple nodes to parallel process the data?
 
Last edited:

Sean Ho

seanho.com
Nov 19, 2019
822
384
63
Vancouver, BC
seanho.com
It'd really need the support of the software devs. If the image processing has a hard real-time requirement, then the flash storage is only being used as a buffer anyway, so might as well use, e.g., RDMA to stream directly from RAM to RAM (python-rdma). Or TCP if the ease of dev work is worth the additional host overhead.

It depends on what the image processing is. If it's, e.g., stacking of 2D confocal microscopy, that's super simple computationally, just need enough RAM to hold two frames' worth of data, using an online/streaming median algorithm.
 
  • Like
Reactions: MBastian

memphizz

New Member
Apr 29, 2022
7
0
1
It'd really need the support of the software devs. If the image processing has a hard real-time requirement, then the flash storage is only being used as a buffer anyway, so might as well use, e.g., RDMA to stream directly from RAM to RAM (python-rdma). Or TCP if the ease of dev work is worth the additional host overhead.

It depends on what the image processing is. If it's, e.g., stacking of 2D confocal microscopy, that's super simple computationally, just need enough RAM to hold two frames' worth of data, using an online/streaming median algorithm.
Sean, I just realized that none of the detector vendors for the microscopes support RDMA. The processing is mainly for SIM, lattice light sheet and super res imaging. Next time I meet with them I suggest this feature.
 

MBastian

Active Member
Jul 17, 2016
221
69
28
Düsseldorf, Germany
As @Sean Ho wrote: RDMA is just one option. A feature request would probably take months or even years to implement.
Am I guessing right, that these SIM microscopes are pushing out images and not a video stream? If yes, there is really no need to write the raw images to flash.
A simple-stupid "solution" would be to export a sufficiently sized RAM Disk on your Linux computation nodes as an NFS share and let the microscope(s) dump their frames there. Thus the software can fetch, compute, push result to storage and delete the raw images in one go.
Ok, that's a really simple and probably really stupid setup I'd only seriously consider for a legacy software environment. There are much nicer solutions available depending on your image processing softwares capabilities.
 
Last edited:
  • Like
Reactions: jdnz and memphizz

memphizz

New Member
Apr 29, 2022
7
0
1
As @Sean Ho wrote: RDMA is just one option. A feature request would probably take months or even years to implement.
Am I guessing right, that these SIM microscopes are pushing out images and not a video stream? If yes, there is really no need to write the raw images to flash.
A simple-stupid "solution" would be to export a sufficiently sized RAM Disk on your Linux computation nodes as an NFS share and let the microscope(s) dump their frames there. Thus the software can fetch, compute, push result to storage and delete the raw images in one go.
Ok, that's a really simple and probably really stupid setup I'd only seriously consider for a legacy software environment. There are much nicer solutions available depending on your image processing softwares capabilities.
I like the idea of ramdisk, especially sharing that with other compute nodes should be efficient too. I think SMB direct is supported by windows 10.
 
Last edited:

memphizz

New Member
Apr 29, 2022
7
0
1
As @Sean Ho wrote: RDMA is just one option. A feature request would probably take months or even years to implement.
Am I guessing right, that these SIM microscopes are pushing out images and not a video stream? If yes, there is really no need to write the raw images to flash.
A simple-stupid "solution" would be to export a sufficiently sized RAM Disk on your Linux computation nodes as an NFS share and let the microscope(s) dump their frames there. Thus the software can fetch, compute, push result to storage and delete the raw images in one go.
Ok, that's a really simple and probably really stupid setup I'd only seriously consider for a legacy software environment. There are much nicer solutions available depending on your image processing softwares capabilities.
I Found rram-linux, sounds very close to what you are suggesting. How would you do it?
 

Sean Ho

seanho.com
Nov 19, 2019
822
384
63
Vancouver, BC
seanho.com
Yes, from a cursory glance that looks like a set of scripts outlining what you could do manually to set it up: create ramdisk, configure for NFS export, enable NFS over RDMA (rpcrdma has superseded svcrdma/xprtrdma), mount from client. The bits with loopback block dev and mdadm are optional.
 
  • Like
Reactions: memphizz

dandanio

Active Member
Oct 10, 2017
199
78
28
Are you using Silk with on prem NVME? Would be great if you could share your application.
I use Kaminario on prem in 2 DC's and a similar product although rebranded Silk in GCP in multiple regions. It scales both horizontally and vertically and by increasing the number of controllers you increase the IOPS. Flexible, a true workhorse. It easily can handle your required IOPS. We use it as an Oracle backend for some sizable databases for a real time data lookups. If you are serious about exploring more, hit me up in privately, no need to derail the conversation.