How to store a lot of small files?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

LL0rd

New Member
Feb 25, 2020
16
1
3
Hey Guys, I need a little bit help. I need to store a lot of small files. The files are binary like images for example or text files. Both file types are about 100k large. I need to be able to recall the file by a path or an unique ID. Like a normal path in a Unix / Win OS, a String like in S3 (or any rational database) or a hash like with IPFS. My problem is, that there going to be a lot of files. Millions? Billions? Just as an example... Let's say, you want to have a copy of github. How to they store their hosted files?

About 10 Years ago I had to code a similar project. It was a website, where customers could buy photos, that were printed and framed, ready to put them on the wall. And for this project I used a MySQL Database to store the Data. It was ok in the beginning. Then DSLRs got more popular and the file size increased. And the number of files increased. And one time MySQL Crashed and the MySQL Own Recovery Process took hours to get the Database online again.

On the other hand, a friend had this year a problem with his project. He got a lot of traffic and the inodes overfilled the FS with billions of small files.

That's why I have currently no idea, how to handle this amount of files.

BTW. It's a non profit project, so there is not really a budget.
 

BoredSysadmin

Not affiliated with Maxell
Mar 2, 2019
1,054
438
83
ZFS will handle nicely lots (2^48 - approx 280 trillion per folder or unlimited per sile system) of small files.
You also could go with a scale-out file system, like Gluster or MooseFS, but it would add significant complexity in my humble opinion.
 
Last edited:

BackupProphet

Well-Known Member
Jul 2, 2014
1,097
661
113
Stavanger, Norway
olavgg.com
There are a lot of options today, ZFS, Seaweedfs if you're into self hosted solutions, SQLite/PostgreSQL. For more advanced and distributed setups Ceph can be used too.
I would recommend starting with ZFS, simplest and can scale to several petabytes with one machine and multiple JBODS.
 
  • Like
Reactions: zac1

piranha32

Active Member
Mar 4, 2023
251
180
43
If you're not afraid to store files in a database, take a look at MongoDB. I used it for storing data obtained from social media. It was stored in several databases, each one with >20B of records, each record was several kB in size. Databases can be easily partitioned between several hosts, and performance of such storage depends on how much hardware you throw at it.
Mongo offers also a filesystem backed by the database (GridFS), however they recommend using it for files bigger than 16MB. For smaller files they recommend storing them in databases. IIRC 16MB was the limit of a single request to the database, but it might have changed since then.
However, this is one of the constraints you should consider.
 
  • Wow
Reactions: Aluminat

LL0rd

New Member
Feb 25, 2020
16
1
3
If you're not afraid to store files in a database, take a look at MongoDB. I used it for storing data obtained from social media. It was stored in several databases, each one with >20B of records, each record was several kB in size. Databases can be easily partitioned between several hosts, and performance of such storage depends on how much hardware you throw at it.
Hmm... The performance would be an interesting thing. Because I had issues at least with MySQL about 10 years ago. Everything started just fine, when just a few users were using the website. But then the website and the database grew. And when it was about 100GB, I really had performance issues. It also seems, that queries against the normal data were slowed down. Yes, the machine at the time had just 32GB of RAM. Today most servers I use have 128 or 256GB of RAM. So maybe MongoDB is a good idea.

One thing I didn't mentioned (or to be honest I deleted it from my question) is Replication or Geo Replication. I think, I can do it more efficient (and cheaper) with a Database than with an FS.
 

piranha32

Active Member
Mar 4, 2023
251
180
43
In my first attempt to build the database I used Postgres. It had huge analytic capabilities, but crashed and burned at slightly over 100M rows on the main table. Mongo on the same hardware ran circles around Postgres, but at the cost of reduced capabilities.
What makes or breaks performance of a Mongo database, is the size of the indices. As long as those, which you use most, fit in the memory, you should be good. My databases were split between several physical shards, and data was distributed between many collections to keep the size of the indices in check, and make them quick to reload. Performance of storage medium also had a huge impact on the performance of the entire database: what ran smoothly on 15k (or 10k, I don't remember) sas drives, became sluggish after moving to 7.2k sata drives.
Storing data on flash drives (or at least moving indices there) should give a huge boost in performance.

As for replication, it comes almost for free in MongoDB. Split the database between shards, make each shard a set of replicas, and this will give you local redundancy, and a big performance boost (replicas can perform searches in parallel). Geo-replication can be achieved by replicating the entire database.

One thing I learned over the years, is that the sets of "this should scale well" and "this scales well" have surprisingly little overlap. Get ready for experimenting with design of the structure of the database a lot, especially if you plan for something more complicated than just a plain data store with single index.
 

BoredSysadmin

Not affiliated with Maxell
Mar 2, 2019
1,054
438
83
One thing I didn't mentioned (or to be honest I deleted it from my question) is Replication or Geo Replication. I think, I can do it more efficient (and cheaper) with a Database than with an FS.
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
I've seen something like this before. The filesystem was ZFS. Config that as you need. Each set of files was in a folder uniquely identified by a random UUID-4. The filenames and UUID were stored in a DB but could also be browsed read-only.

If this was the UUID: fe2b1ff5-a442-4b76-a5a1-0d85429ae77f

Then the storage in the ZFS folder structure was like this, with the first few chars of the UUID creating a subfolder tree:

/jobdata/f/e/2/b/1/fe2b1ff5-a442-4b76-a5a1-0d85429ae77f/somefile.jpg

About 75% of all UUID folders held only one file, as each UUID represented one "job" but it was possible to have more.

The subfolders based on the first few letters of the UUID kept the subfolder count low, so the folders could be browsed by machines normally and ls without hanging. We calculated how many one-letter folders we needed based on the randomness of the type-4 UUID to result in some expected subfolder count.

It's also easy to machine parse, because the folder tree is known by looking only at the UUID, with the tree depth a constant.

UUIDs are also nice because you can have a whole farm of "loading machines" firehosing stuff into it without needing some kind of central unique ID generating machine.

Depending on your use case, this could be modified like maybe instead of the original filenames you just have a package.zip in the UUID subfolder to avoid any special character issues. (Ex: The character ":" can work in Linux filenames but not in Windows) If you control the filenames you will remove these potential issues and maybe improve reliability.

Good Luck!
 

LL0rd

New Member
Feb 25, 2020
16
1
3
I know, that ZFS has a replication Feature. But as far as I know - and your linked site didn't say anything else - the Replication just work "one way". So if I replicate a dataset A to another location where it get dataset B. When I write to dataset A and dataset B (and I just talk about new files, not modification of existing files), there is no way to merge the datasets together. rsync is also not possible, because of the large number of files. It's going to take days to merge the datasets together.

Well, to be honest, I even tried syncthing and resilio to sync the data (in another project) and the result was bad. Bad like some files were wiped, high load, some files were not synced at all. And so on. Just wanted to mention this.
 

oneplane

Well-Known Member
Jul 23, 2021
846
484
63
Depending on how you intend to access them, an object storage system might be better than a filesystem. Even something like MinIO or Mongo GridFS would work fine to get you high-speed file resolution without having to keep track a filesystem-database relation.
 

BoredSysadmin

Not affiliated with Maxell
Mar 2, 2019
1,054
438
83
I know, that ZFS has a replication Feature. But as far as I know - and your linked site didn't say anything else - the Replication just work "one way". So if I replicate a dataset A to another location where it get dataset B. When I write to dataset A and dataset B (and I just talk about new files, not modification of existing files), there is no way to merge the datasets together. rsync is also not possible, because of the large number of files. It's going to take days to merge the datasets together.

Well, to be honest, I even tried syncthing and resilio to sync the data (in another project) and the result was bad. Bad like some files were wiped, high load, some files were not synced at all. And so on. Just wanted to mention this.
Hmm, since you already tried all the regular suspects, I'm at a loss to suggest something free here. The only other thing I could think of (and it's not free) is Nasuni's solution. I've seen and worked with Pure's ActiveCluster before and it does a great job with block-level bi-directional sync, but it's probably 100% irrelevant for your case of non-profit.
 

oneplane

Well-Known Member
Jul 23, 2021
846
484
63
Do you really need block-level sync or is it about the files? And what is the I/O scenario on those files?