Zswap vs Zram

Discussion in 'Linux Admins, Storage and Virtualization' started by arglebargle, Aug 14, 2018.

  1. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    Has anyone here used zswap on their servers? I've been hunting for information comparing zswap performance to zram (which I've used heavily) but I can't seem to find much beyond high level comparisons between the two.

    I'm interested because zswap avoids the LRU inversion problem that zram seems to suffer from by paging least-recently-used pages out to backing storage instead of hoarding them in compressed ram while writing new pages to disk. In theory this should yield better performance once the compressed swap becomes full and I'd like to start over-committing some of my machines that host guest VMs and containers without absolutely tanking performance during peak mem usage.

    If anyone has experience here I'd love to hear it. I'm also willing to run comparisons between the two myself, but I'm not sure exactly how to design the testing scenario or what metrics to collect to make the comparison beyond, say, screenshots of netdata during testing.
     
    #1
  2. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    966
    Likes Received:
    324
    Don't have any deep experience myself other than some idle testing and also looking for people who've done Real Testing :)

    From what I've read on the matter the key point is whether you think your data set is going to be able to fit inside your zswap allocation (which sits in RAM don't forget) - since when this fills up, zswap will decompress pages and write them out to regular swap. Long story short, it's more useful for short-lived pages as long-lived pages will be being continually swapped in and out from RAM to zswap to swap and back again (along with the added CPU overhead and additional latency). This sort of behaviour might well be especially ruinous for access patterns of databases or sparse files.

    If you twisted my arm I'd try using zram first on whatever your workload is as despite the name is behaves more like a compressed swap device than swap does. Personally I've not found any clever solutions for overcommitment other than a) more RAM or b) crazy-fast flash storage used for swap [and that still suffers from craptacular performance as soon as you run into any significant memory IO, and unless you have it lying around or have a non-upgradeable system it's usually best to spend the moolah on the RAM].

    Personally if I was in that situation I'd prefer to try running a regular swap partition (thus eliminating all of these somewhat confusing semi-cache/writeback devices) on a proper compressed block device. I think this is in the pipeline for btrfs (although you could instead use a swap file on a compressed btrfs device) and already doable on ZFS/ZOL with using a compressed zvol. This way there's no quirks to deal with with the way the virtual memory stack is organised and you only get the compression hit when the pages actually hit the disc.
     
    #2
  3. Stephan

    Stephan IT Professional

    Joined:
    Apr 21, 2017
    Messages:
    87
    Likes Received:
    29
    zram user here. Both are hacks to avoid OOM kill by the kernel, but I like zram alot more because I can do away with any backing device. For single, loaded VM servers I use a plain swap file these days, at 100% of RAM scaling from 8 GB up until around 32 GB swap file size. Just to keep OOM away from the VM processes. Anything else like a VM I give zram 25-50% of RAM, split up in #vCPU chunks. If that isn't enough, I either lower the resident set size (rss) requirements, or give the server/VM more RAM, and if that is gone, I buy more RAM or recommend to the client to buy more RAM.
     
    #3
  4. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    @EffrafaxOfWug @Stephan

    Thanks for the replies, I guess there isn't a lot of data out there on performance. Can either of you think of test scenarios that I could use to make comparisons between the two? I have a machine I was planning to use as a VM host coming on Monday, I was thinking I'd strip it down to 4GB ram and fire up a couple of instances of ELK in docker to make some comparisons (each should consume around ~4GB.) That should at least give some indication of how the two behave when they're pretty deep into swap. I'm not sure how well the ELK working set will compress but I should be able to fit two instances in that if I allocate ~3GB to a zram swap bucket.

    @Stephan Have you written your own zram scripts or are you using a premade script? I found a pretty hilarious logic flaw in the two or three that have been copy-pasted around everywhere for the last few years and I'd love to point it out. It's truly :facepalm: worthy, I think you'll appreciate it.
     
    #4
  5. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    966
    Likes Received:
    324
    Are VMs running ELK the situation you're looking to test though, or is the workload more mixed than that...? I don't have much experience with ELK myself, but IIRC it'd be a databasey type load not a million miles away from OLTP and so might not gel well with swap even under normal scenarios. I've only really tested LAMP/LAPP stuff in an overcommitment scenario (relatively easy to bench repeatably using scripted httrack from a set of client machines) but the consensus from that was that DB performance cratered as soon as it had less memory to play with than it thought it did.

    P.S. Don't know if you've seen it already, but this guy did some basic "testing" of a number of combinations of zram and zswap (basically just running firefox in an extremely memory-limited environment and seeing what happened);
    Prevent zram LRU inversion with zswap and max_pool_percent = 100
     
    #5
  6. Stephan

    Stephan IT Professional

    Joined:
    Apr 21, 2017
    Messages:
    87
    Likes Received:
    29
    I use a premade script via a package called "systemd-swap" on Arch Linux. See
    Nefelim4ag/systemd-swap for some sources. I just do "zswap_enabled=0 zram_enabled=1" in the config file and leave everything at default, i.e. LZ4 compressor, use up to 1/4th of RAM for zram and use #cpu streams, i.e. on a 4 core 2 threads per core CPU a total of 8 streams. Assuming a 2:1 compression ratio of LZ4 a 64GB RAM machine will thus reach its OOM condition 16GB later at 80GB total resident set size.
     
    #6
    StuartIanNaylor likes this.
  7. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    Yeah, I read through that while researching last week. Firing off multiple ELK instances isn't my intended workload but it should give me an idea of how things work in the worst possible case scenario.

    I think that script makes the same mistake I saw in the others, can you run the following on one of your boxes that uses it? I'll explain what's going on after, it's completely unbelievable that no one has caught this in all the years these scripts have been up on github.
    Code:
    free -h;echo; cat /proc/swaps;echo; zramctl
     
    #7
  8. Stephan

    Stephan IT Professional

    Joined:
    Apr 21, 2017
    Messages:
    87
    Likes Received:
    29
    Code:
    # free -h;echo; cat /proc/swaps;echo; zramctl
                  total        used        free      shared  buff/cache   available
    Mem:           62Gi       9.6Gi       461Mi        98Mi        52Gi        52Gi
    Swap:          15Gi       126Mi        15Gi
    
    Filename                                Type            Size    Used    Priority
    /dev/zram0                              partition       16482368        129536  32767
    
    NAME       ALGORITHM DISKSIZE   DATA COMPR TOTAL STREAMS MOUNTPOINT
    /dev/zram0 lz4          15.7G 125.5M   55M 62.3M       8 [SWAP]
    
     
    #8
  9. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    Ok yeah, same problem as the others.

    So the assumption these scripts are making is that they'll allocate some fixed chunk of RAM to ZRAM, in this case 1/4 of ram, and fill it up with data until that chunk is full -- but that's not what they're doing.

    What they're actually doing is allocating a ZRAM volume that can hold 1/4 of ram (16G here) of uncompressed data, which is then compressed and stored in RAM. How much actual ram that data occupies depends on the compression ratio lz4 achieves at the time -- it could be 4:1, 3:1, 2:1 or worst case scenario totally incompressible. Unfortunately ZRAM has no way to know ahead of time what the compression ratio will be, nor does it have any way to resize volumes or resize swap spaces. This means that to accurately allocate a fixed size chunk of ram you need to know ahead of time approximately what your compression ratio will be and multiply your zram device size accordingly.

    It's a really simple logic error that's been passed around between literally every ZRAM setup script I saw while researching. I think someone made the mistake 7 or 8 years ago and no one else noticed, they just copied the same algorithm. Google are the only ones who I've seen use ZRAM in the way that the scripts intended: every Chromebook ships with ZRAM enabled and the assumption that lz4 will achieve a 3:1 compression ratio and fill half of total RAM so they've allocated a ZRAM device sized = (RAM/2)*3.

    I've been a bit more conservative on my machines, I usually size my devices at RAM size with the assumption that the worst case compression result will be 2:1 and the most memory ZRAM will actually occupy will be RAM/2. Most of the time the compression ratio is closer to 3:1.

    If you want to actually allocate 1/4 available ram on your machines you could edit the script and change the zram device size calculation but you'll have to estimate your compression ratio. A conservative assumption is 2:1 so the change to the setup script would be:
    Code:
    [ -z "$zram_size" ] && zram_size=$((2*RAM_SIZE/4))
    A 3:1 compression assumption would be:
    Code:
    [ -z "$zram_size" ] && zram_size=$((3*RAM_SIZE/4))
    Honestly, I'm blown away that no one has caught this. I had a laugh about it when I realized what the error was.

    I had the idea earlier to make a number of smaller zram devices of fixed size and run a watchdog script every x minutes/seconds to add/remove devices depending on the current compression ratio. That's about the only way I can think of to dynamically adapt ZRAM to a target ram consumption.

    edit: Here's what the default zram configuration looks like on a chromebook, note the size of total ram and the size of the zram device (1/2 * 3):
    Code:
    localhost ~ # free -h; echo; cat /proc/swaps; echo; zramctl
                 total        used        free      shared  buff/cache   available
    Mem:          7.7Gi       3.5Gi       2.0Gi       536Mi       2.2Gi       3.2Gi
    Swap:          11Gi       5.0Mi        11Gi
    
    Filename                                Type            Size    Used    Priority
    /dev/zram0                              partition       11808196        5188    -1
    
    NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
    /dev/zram0 lzo          11.3G  4.3M  2.1M  2.7M       4 [SWAP]
     
    #9
    Last edited: Aug 19, 2018
  10. StuartIanNaylor

    StuartIanNaylor New Member

    Joined:
    Apr 14, 2019
    Messages:
    3
    Likes Received:
    4
    @arglebargle have to totally agree its totally mystifying how the ubuntu zram-conf script has been emulated and copied so many times whilst in many respects its broken.

    Google have it about right with 3:1 but the choice of alg is yours as Zstd garners much better compression for a higher cpu hit.
    Disk size in zram is this strange virtual size that has no control over actual ram usage as it is possible to get much lower compression on already compressed input and ram usage can spiral or you limit the usefulness of disk size by reducing.
    For sys-admins zram does have a mem_limit directive that most scripts miss where you can define max actual ram usage and just creating large disksizes that will never be used isn't a good idea as even when empty they can have an overhead of approx "zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful"
    Often scripts pointlessly ignore that zram has been multi-stream since kernel 3.15 and pointless make smaller multiple zram device.

    To say zswap avoids the LRU inversion problem is a bit like comparing apples and pears because as far as I am aware there isn't a single script that employs the write back policy of zram.
    I think I am right in saying that Google tweak swapiness and page-cache where swappiness is much higher as its not disk based but near memcpy speed compressed memory and also page-cache is set to zero for single pages rather than the default cache buffer of 8.
    This makes sense as swap is no longer disk based its mem based and has very different working parameters.

    I use zram because a SD card swap just doesn't make sense and in my application I would never use the write back cache with zram. I actually did read https://www.kernel.org/doc/Documentation/blockdev/zram.txt implemented many of its methods but the write back cache just didn't make sense it just seems to be an after thought and poorly implemented and not needed.
    StuartIanNaylor/zram-config does far more than just zswap though as in certain applications high speed zram based drives can increase performance or can be used to mitigate nand block wear.
    In the embedded space a zlog for SD based systems with high write count can extend SD lifetime extensively.
    You can even use zram in an upper RW mount of a OverlayFS and create an emphemeral kiosk device that will always revert on reboot with zero writes to nand or zdir where sync back to persistent only happens on stop.

    zram & zswap are apples & pears and in usage there is no LRU inversion problem because practically no-one employs the write back cache.

    I am still interested in zswap as the use of a smaller zswap cache infront of a file based swap on f2fs could be interesting on faster flash than SD card and cause much less block wear to nand and giving more usage.
    But for many with lzo/4 and the upcoming lzo-rle extremely fast near memcpy swap can be achieved with a 300% gain on the ram it uses.
    Zstd as said could push that to 400% and with directories containing text much higher ratio's can be expected.

    pi@raspberrypi:~ $ zramctl
    NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
    /dev/zram0 lz4 1.2G 4K 76B 4K 4 [SWAP]
    /dev/zram1 lz4 150M 16.3M 25.1K 208K 4 /opt/zram/zram1
    /dev/zram2 lz4 60M 7.5M 1.2M 1.7M 4 /opt/zram/zram2​


    The above from my lowly Pi3 but zram-config or any of the utils such as StuartIanNaylor/zramdrive or StuartIanNaylor/log2zram could be used on any architecture.

    With zram and zswap are not like for like and highly likely not going to be employed for the same working parameters. One isn't better than the other but disk based swap of any sort is generally a last resort and not a solution as when pushed things generally start to grind to a halt.
    With high levels of concurrency and latent pages maybe swap is a better option whilst in a performance scenario running non disk based zram could be a better solution.

    The scripts that are generally available for zram make very little sense if you actually take the time to read https://www.kernel.org/doc/Documentation/blockdev/zram.txt
     
    #10
    Last edited: Apr 14, 2019
    zxv and arglebargle like this.
  11. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    @StuartIanNaylor I'm glad I'm not the only one who read the documentation and spent time scratching their head while reading all of the scripts that copy/paste values from zram-config.

    I have to agree RE: LRU inversion, without a backing device it simply isn't an issue. I think I mistakenly assumed that using a backing device in conjunction with zram was actually a thing people did, but in practice I'm just not seeing it.

    Those are some nice scripts you've written, have you thought about packaging them as debs and making releases available for download? I'm sure you'd get some traffic from the raspberry pi crowd if they could `wget foo && dpkg -i foo` and reduce sd card wear by a significant amount.

    I spent some time writing my own scripts for my Armbian SBCs but ended up retiring them when Armbian included zram support for /var/log and /tmp in the distro by default. Their default values make the usual size and extra device mistakes but work perfectly well after a config file change and a reboot. I raised the default values issue while their scripts were in development but some of the testers managed to break their systems while testing so they stuck with the zram-config defaults. I'm still not sure how they managed that TBH.
     
    #11
  12. StuartIanNaylor

    StuartIanNaylor New Member

    Joined:
    Apr 14, 2019
    Messages:
    3
    Likes Received:
    4
    To be honest my scripts just started as a Ubuntu and Linux community please stop copying and emulating those original 3.14 scripts through exasperation.
    I just wish Armbian worked in a more free deb repo offering that what is a sort of locked in script foundation as generally what they produce is of much value to the Arm community.
    I am not really a scripter as that is about as much I have done since 20 years ago since MS turned up in my career and slightly derailed things :)
    There are multiple services of different zram offerings deliberately as its amazingly lazy and bad coding not to even bother to check for existing zram devices and overwrite.
    Also mem_limit hugely important to sys-admins is very much part of the working and it uses the hot_plug and delete methods deliberately to show how several services can co-exist and create and delete devices without effect on others.

    My next soapbox is OverlayFS as there seems to be a complete lack of accompanying tools in the whole linux eco-sphere the only one is the excellent offering by kmxz/overlayfs-tools who I am trying to arm-twist to add the newer redirect_dir support of OverlayFS.

    I just updated and I am going through each script currently just to tidy and see if any changes need to be made.
    StuartIanNaylor/log2zram got the OverLayFS treatment last night and with the merge tool of kmxz/overlayfs-tools on stop volatile zram is pushed to the lower persistent.
    Give it a whirl if you would and grab a branch and feel free to tidy any of my shoddy hacking.

    StuartIanNaylor/zramdrive is going to get the same OverlayFS treatment as the copyup CoW of OverlayFS and Zram can provide for extremely large and extremely fast directory structures with extremely small ram allocation as often writes are focused on the same small subset in the overall directory.
    Then merge WR Zram upper to persistent lower but again its as much a soapbox project as why are there no official offline tools that merely do the methods that OverlaysFS does online?

    Both work fine and are stable but need testing as its the service boot and shutdown timing that is always a pain and I am sure there will be something.

    StuartIanNaylor/zram-swap-config just needs a tidy and its ready to be wrapped in a deb.
    Please feel from to branch, copy, hack or submit as really the big distro's should be feeding down correct methods than the likes of me trying to feed up.
    If you want to package any of them please feel free as the licence with copyright is only there by request and would gladly transfer.
     
    #12
    arglebargle likes this.
  13. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    I had the exactly the same plan to package and release a new set of zram scripts, but then i got sidetracked by 18 different things and forgot about finishing them :(. By the time I picked it up again Armbian had already bundled functional scripts into their release.

    To be honest Armbian's use and structuring of tools is a nightmare sometimes. Things that should be simple, like "I want to build for this board target, but get my kernel source from this branch/tag" are ridiculously complicated and require reversing their build system to do. They also don't bother to make tags or releases when they build a kernel for release, so rewinding through the repo to get an old version is a treasure hunt through their package release logs to figure out when the kernel was built, and then a best-guess as to where in the kernel commit history to rewind to.

    I shouldn't complain about the work they put in for free, it's just frustrating to jump in with an idea for an improvement and burn out just figuring out the structure and tooling of the project. I like simple, it makes contribution easier for everyone.

    I've been writing scripts since the early 90's when batch files were a thing, and I still wouldn't call myself a "scripter". Your code looks great, FYI.

    Yeah, basically every zram script I've seen makes huge assumptions about what already exists (ie: nothing) and just blasts away with their commands assuming things like /dev/zram# are going to line up with their expectations. That's a big part of why I just retired my own setup/teardown scripts and jumped on the distro bandwagon, working alongside the distro scripts was a pain unless I made sure they ran first on boot and mine stopped first at shutdown so I didn't get in the way of their device number assumptions.

    Fork it and make the changes :D

    I'll check them out, the OverlayFS stuff sounds like it could be really useful combined with zram and I'm not at all familiar with any of the changes in that code in the last few years. I'm kinda deep in a distributed filesystem project right now but I'll test and play around with the scripts when I have time.

    Figuring out Systemd unit start/stop order is a big pain, I had exactly the same problem when I was hacking my own systemd zram solution together. I don't think I saved my copies but rest assured that they were as terrible as you'd expect. At least it's only a pain once though, once you've learned it that won't be an issue :)

    Could you break down the benefits of using zram with overlayFS versus the current periodic rsync method that everyone uses? I think I understand but I've got too much in my head from other projects to read all of the docs and really understand it at the moment.
     
    #13
    Last edited: Apr 19, 2019
  14. StuartIanNaylor

    StuartIanNaylor New Member

    Joined:
    Apr 14, 2019
    Messages:
    3
    Likes Received:
    4
    OverlayFS is a copyup CoW CopyOnwrite file system. So there is no need to copy anything at all on start. On any write OverlaysFS copies up to upper which is zram.
    With OverlayFS there are zero redundant files and the only memory use are files that are written to.

    Also the periodic rsync is logically pointless when it comes to logs as if you crash you have lost up to the last sync which with Log2Ram can be anything up to an hour so all the critical info is likely to be gone.
    Full systems crashes are unlikely so you can sync on stop and not have to bother copying every files from /var/log be it live or redundant as Log2Ram does to pointlessly fill precious memory space.
    Log2Ram every hour writes out complete logs irrespective of size or how many blocks they cover even if its only a single log line addition, its actually debatable when log size is large if writes are any less and maybe even more.
    Depends on log write frequency but with an OverlayFS the CoW of copy up of the file system ensure that only live writes are in zram.

    Using an OverlayFS so that only writes are in memory vastly reduces memory needs or greatly extends the amount of log storage with the same.

    Armbian are brilliant with what they do but yeah I really don't get the framework and manner of deploy.
     
    #14
    Last edited: Apr 18, 2019
    arglebargle likes this.
  15. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    634
    Likes Received:
    203
    Yeah, that actually sounds super useful on an SBC. The current log2ram scripts are a little flakey, I've had journald get confused by the rsync from flash to zram and think it has a lot more space to store history than it actually does. I think it takes the size of the actual storage device and uses 10% of that as it's maximum storage size, rather than 10% of the zram device that /var/log is rsync'd to, which causes a mess when you're stuffing incompressible data into a small zram storage.

    I think I'll be trying your scripts sooner than I planned, I just looked at what the Armbian log2zram scripts are doing with /var/log after a few months of uptime and it's pretty bad: I'm sitting at like 90+ MB of compressed journal history in memory for no reason.

    Edit: PR headed your way :)
     
    #15

Share This Page