Ceph performance/stability issues in Proxmox

dassic · Mar 1, 2024

Hi!

So I wonder if anyone here have an idea about the root cause of what I'm seeing lately with a 11-node proxmox cluster.

It's running Proxmox 8.1.4 with Ceph Reef (8.2.1).
Servers are Dell R730, 2 x 2630 CPUs and 128 GB or more RAM each.
9 of the 11 servers have 8 x 2 TB SATA drives configured as JBOD (non-RAID) and in addition there's a 4 TB PCIE NVMe drive. The last 2 have just boot drives. RAID is disabled on the PERC controller. Patrol read is also disabled.
Network is 2 x 10 Gbps (bonded in LACP mode 3+4) across 2 Arista switches in MLAG and 2 x 1 Gbps (bonded in balance-alb) across a Cisco switch.
Both networks are configured in Proxmox with the 10 Gbps one being primary and the 1 Gbps being backup (for corosync etc.).
The NVMe storage is configured as DB for the OSDs to speed up the spinning disk performance. Everything else related to Ceph is left with default parameters. PGs are set to auto scale.

The cluster runs about 120 VMs, mostly Linux. VM disks are configured as RBD

Generally this setup runs OK, but when running certain tasks such as daily backup (to PBS) or other larger I/O jobs it will on occasion trigger a storage timeout on some VMs (causing EXT4 fs panic), timeout on one or more OSDs and in some cases even timeout on a node. Also reporting OSD slow ops and similar.
When this happens it's never near any limit on the network and even disk I/O seems fairly modest for a cluster with this many spindels & NVMes

If I move the VMs off Ceph to an external NFS storage (running on ZFS, not Ceph), all runs fine, also during backup, so it appears that the issue is centered around the Ceph setup.

Any suggestions ?

Below a bit of info (this is while recovering after one of these issues, so some OSDs are down / resyncing).

Code:

root@proxmox-02:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE   DATA      OMAP     META     AVAIL     %USE   VAR   PGS  STATUS
48    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
49    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
50    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
51    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
52    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
53    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
54    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
55    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
 0    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   948 GiB   25 KiB  3.3 GiB   915 GiB  55.37  1.11   36      up
 1    hdd  2.00130   1.00000  2.0 TiB   1.2 TiB  1016 GiB   24 KiB  3.1 GiB   847 GiB  58.67  1.18   40      up
 2    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   908 GiB   26 KiB  3.2 GiB   955 GiB  53.40  1.07   34      up
 3    hdd  2.00130   1.00000  2.0 TiB   901 GiB   715 GiB   19 KiB  3.1 GiB   1.1 TiB  43.99  0.89   30      up
 4    hdd  2.00130   1.00000  2.0 TiB  1012 GiB   826 GiB   13 KiB  2.5 GiB   1.0 TiB  49.38  0.99   32      up
 5    hdd  2.00130   1.00000  2.0 TiB   946 GiB   760 GiB   12 KiB  2.4 GiB   1.1 TiB  46.15  0.93   31      up
 6    hdd  2.00130   1.00000  2.0 TiB   951 GiB   765 GiB   26 KiB  2.8 GiB   1.1 TiB  46.41  0.93   34      up
 7    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   964 GiB   27 KiB  2.6 GiB   899 GiB  56.14  1.13   39      up
 8    hdd  2.00130   1.00000  2.0 TiB  1011 GiB   824 GiB   29 KiB  3.2 GiB   1.0 TiB  49.32  0.99   33      up
 9    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   855 GiB   14 KiB  3.1 GiB  1008 GiB  50.82  1.02   36      up
10    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   933 GiB   24 KiB  3.1 GiB   930 GiB  54.61  1.10   38      up
11    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   906 GiB   25 KiB  3.3 GiB   957 GiB  53.31  1.07   33      up
12    hdd  2.00130   1.00000  2.0 TiB   984 GiB   798 GiB   22 KiB  3.2 GiB   1.0 TiB  48.02  0.97   33      up
13    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   845 GiB   21 KiB  2.6 GiB  1018 GiB  50.33  1.01   36      up
14    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   857 GiB   21 KiB  2.4 GiB  1006 GiB  50.89  1.02   36      up
15    hdd  2.00130   1.00000  2.0 TiB   939 GiB   753 GiB   27 KiB  1.9 GiB   1.1 TiB  45.83  0.92   31      up
16    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   893 GiB   14 KiB  3.3 GiB   970 GiB  52.69  1.06   39      up
17    hdd  2.00130   1.00000  2.0 TiB  1014 GiB   828 GiB   22 KiB  3.3 GiB   1.0 TiB  49.48  1.00   32      up
18    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   932 GiB   30 KiB  3.2 GiB   931 GiB  54.55  1.10   36      up
19    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   903 GiB   40 KiB  3.4 GiB   960 GiB  53.15  1.07   36      up
20    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   967 GiB   18 KiB  3.4 GiB   896 GiB  56.28  1.13   39      up
21    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   859 GiB   27 KiB  2.8 GiB  1004 GiB  51.01  1.03   33      up
22    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   918 GiB   17 KiB  3.5 GiB   945 GiB  53.90  1.09   36      up
23    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   936 GiB   17 KiB  3.2 GiB   927 GiB  54.77  1.10   39      up
24    hdd  2.00130   1.00000  2.0 TiB   968 GiB   781 GiB   29 KiB  3.4 GiB   1.1 TiB  47.22  0.95   35      up
25    hdd  2.00130   1.00000  2.0 TiB   893 GiB   706 GiB   25 KiB  3.0 GiB   1.1 TiB  43.57  0.88   29      up
26    hdd  2.00130   1.00000  2.0 TiB   847 GiB   660 GiB   24 KiB  2.9 GiB   1.2 TiB  41.32  0.83   25      up
27    hdd  2.00130   1.00000  2.0 TiB  1015 GiB   829 GiB   20 KiB  3.1 GiB   1.0 TiB  49.53  1.00   36      up
28    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   881 GiB   26 KiB  3.1 GiB   982 GiB  52.09  1.05   32      up
29    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   847 GiB   25 KiB  3.3 GiB  1016 GiB  50.42  1.02   37      up
30    hdd  2.00130   1.00000  2.0 TiB   1.1 TiB   909 GiB   12 KiB  3.1 GiB   954 GiB  53.46  1.08   37      up
56    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0   17      up
57    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    8      up
58    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    5      up
59    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    8      up
60    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0   11      up
61    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
62    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0   15      up
63    hdd  2.00130         0      0 B       0 B       0 B      0 B      0 B       0 B      0     0    0    down
31    hdd  2.00130   1.00000  2.0 TiB   963 GiB   777 GiB   28 KiB  2.7 GiB   1.1 TiB  47.00  0.95   31      up
32    hdd  2.00130   1.00000  2.0 TiB  1011 GiB   825 GiB   11 KiB  3.2 GiB   1.0 TiB  49.35  0.99   33      up
33    hdd  2.00130   1.00000  2.0 TiB   928 GiB   742 GiB   22 KiB  3.1 GiB   1.1 TiB  45.29  0.91   29      up
34    hdd  2.00130   1.00000  2.0 TiB   911 GiB   725 GiB   28 KiB  3.0 GiB   1.1 TiB  44.46  0.90   30      up
35    hdd  2.00130   1.00000  2.0 TiB  1009 GiB   822 GiB   15 KiB  2.7 GiB   1.0 TiB  49.21  0.99   34      up
36    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   881 GiB   25 KiB  3.2 GiB   982 GiB  52.06  1.05   38      up
37    hdd  2.00130   1.00000  2.0 TiB   969 GiB   783 GiB   19 KiB  3.1 GiB   1.1 TiB  47.30  0.95   35      up
38    hdd  2.00130   1.00000  2.0 TiB   996 GiB   809 GiB   28 KiB  2.6 GiB   1.0 TiB  48.58  0.98   33      up
40    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   842 GiB   19 KiB  3.3 GiB  1021 GiB  50.15  1.01   33      up
41    hdd  2.00130   1.00000  2.0 TiB   908 GiB   722 GiB   17 KiB  2.7 GiB   1.1 TiB  44.32  0.89   29      up
42    hdd  2.00130   1.00000  2.0 TiB   985 GiB   798 GiB   15 KiB  3.0 GiB   1.0 TiB  48.05  0.97   31      up
43    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   855 GiB   18 KiB  3.3 GiB  1008 GiB  50.80  1.02   39      up
44    hdd  2.00130   1.00000  2.0 TiB   828 GiB   641 GiB   28 KiB  2.9 GiB   1.2 TiB  40.39  0.81   25      up
45    hdd  2.00130   1.00000  2.0 TiB   853 GiB   666 GiB   23 KiB  2.8 GiB   1.2 TiB  41.60  0.84   25      up
46    hdd  2.00130   1.00000  2.0 TiB   1.0 TiB   880 GiB   18 KiB  3.4 GiB   983 GiB  52.05  1.05   34      up
47    hdd  2.00130   1.00000  2.0 TiB   983 GiB   796 GiB   22 KiB  3.2 GiB   1.0 TiB  47.95  0.97   31      up
                       TOTAL   94 TiB    47 TiB    38 TiB  1.0 MiB  142 GiB    47 TiB  49.67                   
MIN/MAX VAR: 0.81/1.18  STDDEV: 4.15


root@proxmox-02:~# dstat -cldn
--total-cpu-usage-- ---load-avg--- -dsk/total- -net/total-
usr sys idl wai stl| 1m   5m  15m | read  writ| recv  send
  6   2  92   0   0|4.72 4.00 3.40|  22M   14M|   0     0
  4   2  94   0   0|4.42 3.95 3.38|  84M   80M| 180M  146M
  5   2  93   0   0|4.42 3.95 3.38|  85M   75M| 148M  160M
  4   2  94   0   0|4.42 3.95 3.38| 103M   59M| 155M  192M
  5   2  92   1   0|4.42 3.95 3.38| 104M   88M| 141M  213M
  5   2  92   0   0|4.42 3.95 3.38| 105M   82M| 168M  210M
  6   2  92   0   0|4.31 3.93 3.38| 107M   71M| 158M  213M
  5   2  92   0   0|4.31 3.93 3.38| 101M   75M| 157M  223M
  4   1  94   0   0|4.31 3.93 3.38| 121M   77M| 145M  219M
  4   2  94   0   0|4.31 3.93 3.38| 110M   76M| 143M  245M
  5   2  93   1   0|4.31 3.93 3.38|  92M   78M| 154M  198M
  5   1  94   0   0|4.12 3.90 3.37| 129M   76M| 160M  236M
  4   2  94   0   0|4.12 3.90 3.37| 118M   82M| 164M  259M
  4   1  94   0   0|4.12 3.90 3.37| 109M   83M| 163M  229M
  5   2  93   0   0|4.12 3.90 3.37| 123M   77M| 162M  244M
  6   2  91   1   0|4.12 3.90 3.37| 123M   86M| 162M  258M
  6   2  92   0   0|4.19 3.92 3.38| 121M   82M|  81M  116M
  4   2  94   0   0|4.19 3.92 3.38| 121M   72M| 157M  243M
  5   2  93   0   0|4.19 3.92 3.38| 116M   86M| 169M  246M
  4   2  94   0   0|4.19 3.92 3.38| 126M   77M| 162M  260M
  5   2  92   1   0|4.19 3.92 3.38|  87M   82M| 159M  210M
  4   2  94   0   0|4.10 3.90 3.38| 108M   81M| 160M  234M
  5   1  94   0   0|4.10 3.90 3.38| 122M   81M| 161M  208M
  5   1  94   0   0|4.10 3.90 3.38| 102M   79M| 156M  236M
  5   2  93   0   0|4.10 3.90 3.38| 110M   84M| 171M  228M
  6   2  91   0   0|4.10 3.90 3.38| 107M   78M| 152M  217M
  6   2  91   1   0|3.85 3.86 3.37| 110M   91M| 176M  227M
  5   1  94   0   0|3.85 3.86 3.37|  97M   85M| 166M  205
 
  root@proxmox-02:~# pveceph status
  cluster:
    id:     f6706837-39f2-4e5d-adde-176269859e22
    health: HEALTH_WARN
            Reduced data availability: 34 pgs inactive, 34 pgs peering
            Degraded data redundancy: 556816/11821971 objects degraded (4.710%), 93 pgs degraded, 93 pgs undersized
            282 slow ops, oldest one blocked for 726 sec, daemons [osd.0,osd.11,osd.12,osd.19,osd.2,osd.21,osd.22,osd.3,osd.31,osd.33]... have slow ops.
 
  services:
    mon: 8 daemons, quorum proxmox-02,proxmox-03,proxmox-04,proxmox-05,proxmox-08,proxmox-09,proxmox-01,proxmox-07 (age 4h)
    mgr: proxmox-03(active, since 28h), standbys: proxmox-09, proxmox-07, proxmox-05, proxmox-04, proxmox-08, proxmox-02, proxmox-01
    mds: 1/1 daemons up, 6 standby
    osd: 63 osds: 51 up (since 87s), 47 in (since 18m); 143 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 577 pgs
    objects: 3.94M objects, 15 TiB
    usage:   47 TiB used, 47 TiB / 94 TiB avail
    pgs:     6.239% pgs not active
             556816/11821971 objects degraded (4.710%)
             116363/11821971 objects misplaced (0.984%)
             429 active+clean
             70  active+undersized+degraded+remapped+backfill_wait
             32  remapped+peering
             23  active+undersized+degraded+remapped+backfilling
             18  active+remapped+backfill_wait
             4   peering
             1   active+clean+scrubbing
 
  io:
    client:   0 B/s rd, 145 KiB/s wr, 0 op/s rd, 19 op/s wr
    recovery: 662 MiB/s, 181 objects/s

ano · Mar 1, 2024

your using enterprise 3.84 drives? not 4TB consumer? just checking
4TB consumer = your life will be sad?

mrpasc · Mar 1, 2024

dassic said:
RAID is disabled on the PERC controller. Patrol read is also disabled.

Do yourself a favour and replace those PERCs with real HBAs. Go for any LSI 3008 based ones. PERC ( I assume those are H330/H730/H730P ones?) are very limited in IO because of a crippled queue depth if in Non-Raid mode.

rtech · Mar 1, 2024

CAPACITORS!
One important thing to note is that all writes in Ceph are transactional, even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent #RAID WRITE HOLE-like situations.

To make it more clear this means that Ceph does not use any drive write buffers. It does quite the opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used.

source:

Ceph performance — YourcmcWiki

yourcmc.ru

So are those consumer ssds?

dassic · Mar 2, 2024

Thanks for the replies.
So it turns out that some, but not all, of the SSDs were consumer without PLP, causing a rather unpredictable behavior. So the task will be to get that aligned so all are enterprise with PLP.
As for the PERC controllers, yes they're kinda slow. That may be be the next thing to look at, replacing them with something non-RAID with better performance for pure JBOD.

iGene · Mar 2, 2024

PLP is very important for Ceph as it use fsync when writing. Without PLP the write will directly goes to NAND, with consumer drive it's like 600-800 IOPS 4k write.

The read write speed in consumer drive datasheet is usually the speed with caching.

Search

Ceph performance/stability issues in Proxmox

dassic

New Member

ano

Well-Known Member

mrpasc

Well-Known Member

rtech

Active Member

Ceph performance — YourcmcWiki

dassic

New Member

iGene

Member