Zswap vs Zram

Discussion in 'Linux Admins, Storage and Virtualization' started by arglebargle, Aug 14, 2018.

  1. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    244
    Likes Received:
    75
    Has anyone here used zswap on their servers? I've been hunting for information comparing zswap performance to zram (which I've used heavily) but I can't seem to find much beyond high level comparisons between the two.

    I'm interested because zswap avoids the LRU inversion problem that zram seems to suffer from by paging least-recently-used pages out to backing storage instead of hoarding them in compressed ram while writing new pages to disk. In theory this should yield better performance once the compressed swap becomes full and I'd like to start over-committing some of my machines that host guest VMs and containers without absolutely tanking performance during peak mem usage.

    If anyone has experience here I'd love to hear it. I'm also willing to run comparisons between the two myself, but I'm not sure exactly how to design the testing scenario or what metrics to collect to make the comparison beyond, say, screenshots of netdata during testing.
     
    #1
  2. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    669
    Likes Received:
    233
    Don't have any deep experience myself other than some idle testing and also looking for people who've done Real Testing :)

    From what I've read on the matter the key point is whether you think your data set is going to be able to fit inside your zswap allocation (which sits in RAM don't forget) - since when this fills up, zswap will decompress pages and write them out to regular swap. Long story short, it's more useful for short-lived pages as long-lived pages will be being continually swapped in and out from RAM to zswap to swap and back again (along with the added CPU overhead and additional latency). This sort of behaviour might well be especially ruinous for access patterns of databases or sparse files.

    If you twisted my arm I'd try using zram first on whatever your workload is as despite the name is behaves more like a compressed swap device than swap does. Personally I've not found any clever solutions for overcommitment other than a) more RAM or b) crazy-fast flash storage used for swap [and that still suffers from craptacular performance as soon as you run into any significant memory IO, and unless you have it lying around or have a non-upgradeable system it's usually best to spend the moolah on the RAM].

    Personally if I was in that situation I'd prefer to try running a regular swap partition (thus eliminating all of these somewhat confusing semi-cache/writeback devices) on a proper compressed block device. I think this is in the pipeline for btrfs (although you could instead use a swap file on a compressed btrfs device) and already doable on ZFS/ZOL with using a compressed zvol. This way there's no quirks to deal with with the way the virtual memory stack is organised and you only get the compression hit when the pages actually hit the disc.
     
    #2
  3. Stephan

    Stephan IT Professional

    Joined:
    Apr 21, 2017
    Messages:
    70
    Likes Received:
    25
    zram user here. Both are hacks to avoid OOM kill by the kernel, but I like zram alot more because I can do away with any backing device. For single, loaded VM servers I use a plain swap file these days, at 100% of RAM scaling from 8 GB up until around 32 GB swap file size. Just to keep OOM away from the VM processes. Anything else like a VM I give zram 25-50% of RAM, split up in #vCPU chunks. If that isn't enough, I either lower the resident set size (rss) requirements, or give the server/VM more RAM, and if that is gone, I buy more RAM or recommend to the client to buy more RAM.
     
    #3
  4. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    244
    Likes Received:
    75
    @EffrafaxOfWug @Stephan

    Thanks for the replies, I guess there isn't a lot of data out there on performance. Can either of you think of test scenarios that I could use to make comparisons between the two? I have a machine I was planning to use as a VM host coming on Monday, I was thinking I'd strip it down to 4GB ram and fire up a couple of instances of ELK in docker to make some comparisons (each should consume around ~4GB.) That should at least give some indication of how the two behave when they're pretty deep into swap. I'm not sure how well the ELK working set will compress but I should be able to fit two instances in that if I allocate ~3GB to a zram swap bucket.

    @Stephan Have you written your own zram scripts or are you using a premade script? I found a pretty hilarious logic flaw in the two or three that have been copy-pasted around everywhere for the last few years and I'd love to point it out. It's truly :facepalm: worthy, I think you'll appreciate it.
     
    #4
  5. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    669
    Likes Received:
    233
    Are VMs running ELK the situation you're looking to test though, or is the workload more mixed than that...? I don't have much experience with ELK myself, but IIRC it'd be a databasey type load not a million miles away from OLTP and so might not gel well with swap even under normal scenarios. I've only really tested LAMP/LAPP stuff in an overcommitment scenario (relatively easy to bench repeatably using scripted httrack from a set of client machines) but the consensus from that was that DB performance cratered as soon as it had less memory to play with than it thought it did.

    P.S. Don't know if you've seen it already, but this guy did some basic "testing" of a number of combinations of zram and zswap (basically just running firefox in an extremely memory-limited environment and seeing what happened);
    Prevent zram LRU inversion with zswap and max_pool_percent = 100
     
    #5
  6. Stephan

    Stephan IT Professional

    Joined:
    Apr 21, 2017
    Messages:
    70
    Likes Received:
    25
    I use a premade script via a package called "systemd-swap" on Arch Linux. See
    Nefelim4ag/systemd-swap for some sources. I just do "zswap_enabled=0 zram_enabled=1" in the config file and leave everything at default, i.e. LZ4 compressor, use up to 1/4th of RAM for zram and use #cpu streams, i.e. on a 4 core 2 threads per core CPU a total of 8 streams. Assuming a 2:1 compression ratio of LZ4 a 64GB RAM machine will thus reach its OOM condition 16GB later at 80GB total resident set size.
     
    #6
  7. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    244
    Likes Received:
    75
    Yeah, I read through that while researching last week. Firing off multiple ELK instances isn't my intended workload but it should give me an idea of how things work in the worst possible case scenario.

    I think that script makes the same mistake I saw in the others, can you run the following on one of your boxes that uses it? I'll explain what's going on after, it's completely unbelievable that no one has caught this in all the years these scripts have been up on github.
    Code:
    free -h;echo; cat /proc/swaps;echo; zramctl
     
    #7
  8. Stephan

    Stephan IT Professional

    Joined:
    Apr 21, 2017
    Messages:
    70
    Likes Received:
    25
    Code:
    # free -h;echo; cat /proc/swaps;echo; zramctl
                  total        used        free      shared  buff/cache   available
    Mem:           62Gi       9.6Gi       461Mi        98Mi        52Gi        52Gi
    Swap:          15Gi       126Mi        15Gi
    
    Filename                                Type            Size    Used    Priority
    /dev/zram0                              partition       16482368        129536  32767
    
    NAME       ALGORITHM DISKSIZE   DATA COMPR TOTAL STREAMS MOUNTPOINT
    /dev/zram0 lz4          15.7G 125.5M   55M 62.3M       8 [SWAP]
    
     
    #8
  9. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    244
    Likes Received:
    75
    Ok yeah, same problem as the others.

    So the assumption these scripts are making is that they'll allocate some fixed chunk of RAM to ZRAM, in this case 1/4 of ram, and fill it up with data until that chunk is full -- but that's not what they're doing.

    What they're actually doing is allocating a ZRAM volume that can hold 1/4 of ram (16G here) of uncompressed data, which is then compressed and stored in RAM. How much actual ram that data occupies depends on the compression ratio lz4 achieves at the time -- it could be 4:1, 3:1, 2:1 or worst case scenario totally incompressible. Unfortunately ZRAM has no way to know ahead of time what the compression ratio will be, nor does it have any way to resize volumes or resize swap spaces. This means that to accurately allocate a fixed size chunk of ram you need to know ahead of time approximately what your compression ratio will be and multiply your zram device size accordingly.

    It's a really simple logic error that's been passed around between literally every ZRAM setup script I saw while researching. I think someone made the mistake 7 or 8 years ago and no one else noticed, they just copied the same algorithm. Google are the only ones who I've seen use ZRAM in the way that the scripts intended: every Chromebook ships with ZRAM enabled and the assumption that lz4 will achieve a 3:1 compression ratio and fill half of total RAM so they've allocated a ZRAM device sized = (RAM/2)*3.

    I've been a bit more conservative on my machines, I usually size my devices at RAM size with the assumption that the worst case compression result will be 2:1 and the most memory ZRAM will actually occupy will be RAM/2. Most of the time the compression ratio is closer to 3:1.

    If you want to actually allocate 1/4 available ram on your machines you could edit the script and change the zram device size calculation but you'll have to estimate your compression ratio. A conservative assumption is 2:1 so the change to the setup script would be:
    Code:
    [ -z "$zram_size" ] && zram_size=$((2*RAM_SIZE/4))
    A 3:1 compression assumption would be:
    Code:
    [ -z "$zram_size" ] && zram_size=$((3*RAM_SIZE/4))
    Honestly, I'm blown away that no one has caught this. I had a laugh about it when I realized what the error was.

    I had the idea earlier to make a number of smaller zram devices of fixed size and run a watchdog script every x minutes/seconds to add/remove devices depending on the current compression ratio. That's about the only way I can think of to dynamically adapt ZRAM to a target ram consumption.

    edit: Here's what the default zram configuration looks like on a chromebook, note the size of total ram and the size of the zram device (1/2 * 3):
    Code:
    localhost ~ # free -h; echo; cat /proc/swaps; echo; zramctl
                 total        used        free      shared  buff/cache   available
    Mem:          7.7Gi       3.5Gi       2.0Gi       536Mi       2.2Gi       3.2Gi
    Swap:          11Gi       5.0Mi        11Gi
    
    Filename                                Type            Size    Used    Priority
    /dev/zram0                              partition       11808196        5188    -1
    
    NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
    /dev/zram0 lzo          11.3G  4.3M  2.1M  2.7M       4 [SWAP]
     
    #9
    Last edited: Aug 19, 2018

Share This Page