Hi!
So I wonder if anyone here have an idea about the root cause of what I'm seeing lately with a 11-node proxmox cluster.
The cluster runs about 120 VMs, mostly Linux. VM disks are configured as RBD
Generally this setup runs OK, but when running certain tasks such as daily backup (to PBS) or other larger I/O jobs it will on occasion trigger a storage timeout on some VMs (causing EXT4 fs panic), timeout on one or more OSDs and in some cases even timeout on a node. Also reporting OSD slow ops and similar.
When this happens it's never near any limit on the network and even disk I/O seems fairly modest for a cluster with this many spindels & NVMes
If I move the VMs off Ceph to an external NFS storage (running on ZFS, not Ceph), all runs fine, also during backup, so it appears that the issue is centered around the Ceph setup.
Any suggestions ?
Below a bit of info (this is while recovering after one of these issues, so some OSDs are down / resyncing).
So I wonder if anyone here have an idea about the root cause of what I'm seeing lately with a 11-node proxmox cluster.
- It's running Proxmox 8.1.4 with Ceph Reef (8.2.1).
- Servers are Dell R730, 2 x 2630 CPUs and 128 GB or more RAM each.
- 9 of the 11 servers have 8 x 2 TB SATA drives configured as JBOD (non-RAID) and in addition there's a 4 TB PCIE NVMe drive. The last 2 have just boot drives. RAID is disabled on the PERC controller. Patrol read is also disabled.
- Network is 2 x 10 Gbps (bonded in LACP mode 3+4) across 2 Arista switches in MLAG and 2 x 1 Gbps (bonded in balance-alb) across a Cisco switch.
- Both networks are configured in Proxmox with the 10 Gbps one being primary and the 1 Gbps being backup (for corosync etc.).
- The NVMe storage is configured as DB for the OSDs to speed up the spinning disk performance. Everything else related to Ceph is left with default parameters. PGs are set to auto scale.
The cluster runs about 120 VMs, mostly Linux. VM disks are configured as RBD
Generally this setup runs OK, but when running certain tasks such as daily backup (to PBS) or other larger I/O jobs it will on occasion trigger a storage timeout on some VMs (causing EXT4 fs panic), timeout on one or more OSDs and in some cases even timeout on a node. Also reporting OSD slow ops and similar.
When this happens it's never near any limit on the network and even disk I/O seems fairly modest for a cluster with this many spindels & NVMes
If I move the VMs off Ceph to an external NFS storage (running on ZFS, not Ceph), all runs fine, also during backup, so it appears that the issue is centered around the Ceph setup.
Any suggestions ?
Below a bit of info (this is while recovering after one of these issues, so some OSDs are down / resyncing).
Code:
root@proxmox-02:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
48 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
49 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
50 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
51 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
52 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
53 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
54 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
55 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
0 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 948 GiB 25 KiB 3.3 GiB 915 GiB 55.37 1.11 36 up
1 hdd 2.00130 1.00000 2.0 TiB 1.2 TiB 1016 GiB 24 KiB 3.1 GiB 847 GiB 58.67 1.18 40 up
2 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 908 GiB 26 KiB 3.2 GiB 955 GiB 53.40 1.07 34 up
3 hdd 2.00130 1.00000 2.0 TiB 901 GiB 715 GiB 19 KiB 3.1 GiB 1.1 TiB 43.99 0.89 30 up
4 hdd 2.00130 1.00000 2.0 TiB 1012 GiB 826 GiB 13 KiB 2.5 GiB 1.0 TiB 49.38 0.99 32 up
5 hdd 2.00130 1.00000 2.0 TiB 946 GiB 760 GiB 12 KiB 2.4 GiB 1.1 TiB 46.15 0.93 31 up
6 hdd 2.00130 1.00000 2.0 TiB 951 GiB 765 GiB 26 KiB 2.8 GiB 1.1 TiB 46.41 0.93 34 up
7 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 964 GiB 27 KiB 2.6 GiB 899 GiB 56.14 1.13 39 up
8 hdd 2.00130 1.00000 2.0 TiB 1011 GiB 824 GiB 29 KiB 3.2 GiB 1.0 TiB 49.32 0.99 33 up
9 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 855 GiB 14 KiB 3.1 GiB 1008 GiB 50.82 1.02 36 up
10 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 933 GiB 24 KiB 3.1 GiB 930 GiB 54.61 1.10 38 up
11 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 906 GiB 25 KiB 3.3 GiB 957 GiB 53.31 1.07 33 up
12 hdd 2.00130 1.00000 2.0 TiB 984 GiB 798 GiB 22 KiB 3.2 GiB 1.0 TiB 48.02 0.97 33 up
13 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 845 GiB 21 KiB 2.6 GiB 1018 GiB 50.33 1.01 36 up
14 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 857 GiB 21 KiB 2.4 GiB 1006 GiB 50.89 1.02 36 up
15 hdd 2.00130 1.00000 2.0 TiB 939 GiB 753 GiB 27 KiB 1.9 GiB 1.1 TiB 45.83 0.92 31 up
16 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 893 GiB 14 KiB 3.3 GiB 970 GiB 52.69 1.06 39 up
17 hdd 2.00130 1.00000 2.0 TiB 1014 GiB 828 GiB 22 KiB 3.3 GiB 1.0 TiB 49.48 1.00 32 up
18 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 932 GiB 30 KiB 3.2 GiB 931 GiB 54.55 1.10 36 up
19 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 903 GiB 40 KiB 3.4 GiB 960 GiB 53.15 1.07 36 up
20 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 967 GiB 18 KiB 3.4 GiB 896 GiB 56.28 1.13 39 up
21 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 859 GiB 27 KiB 2.8 GiB 1004 GiB 51.01 1.03 33 up
22 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 918 GiB 17 KiB 3.5 GiB 945 GiB 53.90 1.09 36 up
23 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 936 GiB 17 KiB 3.2 GiB 927 GiB 54.77 1.10 39 up
24 hdd 2.00130 1.00000 2.0 TiB 968 GiB 781 GiB 29 KiB 3.4 GiB 1.1 TiB 47.22 0.95 35 up
25 hdd 2.00130 1.00000 2.0 TiB 893 GiB 706 GiB 25 KiB 3.0 GiB 1.1 TiB 43.57 0.88 29 up
26 hdd 2.00130 1.00000 2.0 TiB 847 GiB 660 GiB 24 KiB 2.9 GiB 1.2 TiB 41.32 0.83 25 up
27 hdd 2.00130 1.00000 2.0 TiB 1015 GiB 829 GiB 20 KiB 3.1 GiB 1.0 TiB 49.53 1.00 36 up
28 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 881 GiB 26 KiB 3.1 GiB 982 GiB 52.09 1.05 32 up
29 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 847 GiB 25 KiB 3.3 GiB 1016 GiB 50.42 1.02 37 up
30 hdd 2.00130 1.00000 2.0 TiB 1.1 TiB 909 GiB 12 KiB 3.1 GiB 954 GiB 53.46 1.08 37 up
56 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 17 up
57 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 8 up
58 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 5 up
59 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 8 up
60 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 11 up
61 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
62 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 15 up
63 hdd 2.00130 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
31 hdd 2.00130 1.00000 2.0 TiB 963 GiB 777 GiB 28 KiB 2.7 GiB 1.1 TiB 47.00 0.95 31 up
32 hdd 2.00130 1.00000 2.0 TiB 1011 GiB 825 GiB 11 KiB 3.2 GiB 1.0 TiB 49.35 0.99 33 up
33 hdd 2.00130 1.00000 2.0 TiB 928 GiB 742 GiB 22 KiB 3.1 GiB 1.1 TiB 45.29 0.91 29 up
34 hdd 2.00130 1.00000 2.0 TiB 911 GiB 725 GiB 28 KiB 3.0 GiB 1.1 TiB 44.46 0.90 30 up
35 hdd 2.00130 1.00000 2.0 TiB 1009 GiB 822 GiB 15 KiB 2.7 GiB 1.0 TiB 49.21 0.99 34 up
36 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 881 GiB 25 KiB 3.2 GiB 982 GiB 52.06 1.05 38 up
37 hdd 2.00130 1.00000 2.0 TiB 969 GiB 783 GiB 19 KiB 3.1 GiB 1.1 TiB 47.30 0.95 35 up
38 hdd 2.00130 1.00000 2.0 TiB 996 GiB 809 GiB 28 KiB 2.6 GiB 1.0 TiB 48.58 0.98 33 up
40 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 842 GiB 19 KiB 3.3 GiB 1021 GiB 50.15 1.01 33 up
41 hdd 2.00130 1.00000 2.0 TiB 908 GiB 722 GiB 17 KiB 2.7 GiB 1.1 TiB 44.32 0.89 29 up
42 hdd 2.00130 1.00000 2.0 TiB 985 GiB 798 GiB 15 KiB 3.0 GiB 1.0 TiB 48.05 0.97 31 up
43 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 855 GiB 18 KiB 3.3 GiB 1008 GiB 50.80 1.02 39 up
44 hdd 2.00130 1.00000 2.0 TiB 828 GiB 641 GiB 28 KiB 2.9 GiB 1.2 TiB 40.39 0.81 25 up
45 hdd 2.00130 1.00000 2.0 TiB 853 GiB 666 GiB 23 KiB 2.8 GiB 1.2 TiB 41.60 0.84 25 up
46 hdd 2.00130 1.00000 2.0 TiB 1.0 TiB 880 GiB 18 KiB 3.4 GiB 983 GiB 52.05 1.05 34 up
47 hdd 2.00130 1.00000 2.0 TiB 983 GiB 796 GiB 22 KiB 3.2 GiB 1.0 TiB 47.95 0.97 31 up
TOTAL 94 TiB 47 TiB 38 TiB 1.0 MiB 142 GiB 47 TiB 49.67
MIN/MAX VAR: 0.81/1.18 STDDEV: 4.15
root@proxmox-02:~# dstat -cldn
--total-cpu-usage-- ---load-avg--- -dsk/total- -net/total-
usr sys idl wai stl| 1m 5m 15m | read writ| recv send
6 2 92 0 0|4.72 4.00 3.40| 22M 14M| 0 0
4 2 94 0 0|4.42 3.95 3.38| 84M 80M| 180M 146M
5 2 93 0 0|4.42 3.95 3.38| 85M 75M| 148M 160M
4 2 94 0 0|4.42 3.95 3.38| 103M 59M| 155M 192M
5 2 92 1 0|4.42 3.95 3.38| 104M 88M| 141M 213M
5 2 92 0 0|4.42 3.95 3.38| 105M 82M| 168M 210M
6 2 92 0 0|4.31 3.93 3.38| 107M 71M| 158M 213M
5 2 92 0 0|4.31 3.93 3.38| 101M 75M| 157M 223M
4 1 94 0 0|4.31 3.93 3.38| 121M 77M| 145M 219M
4 2 94 0 0|4.31 3.93 3.38| 110M 76M| 143M 245M
5 2 93 1 0|4.31 3.93 3.38| 92M 78M| 154M 198M
5 1 94 0 0|4.12 3.90 3.37| 129M 76M| 160M 236M
4 2 94 0 0|4.12 3.90 3.37| 118M 82M| 164M 259M
4 1 94 0 0|4.12 3.90 3.37| 109M 83M| 163M 229M
5 2 93 0 0|4.12 3.90 3.37| 123M 77M| 162M 244M
6 2 91 1 0|4.12 3.90 3.37| 123M 86M| 162M 258M
6 2 92 0 0|4.19 3.92 3.38| 121M 82M| 81M 116M
4 2 94 0 0|4.19 3.92 3.38| 121M 72M| 157M 243M
5 2 93 0 0|4.19 3.92 3.38| 116M 86M| 169M 246M
4 2 94 0 0|4.19 3.92 3.38| 126M 77M| 162M 260M
5 2 92 1 0|4.19 3.92 3.38| 87M 82M| 159M 210M
4 2 94 0 0|4.10 3.90 3.38| 108M 81M| 160M 234M
5 1 94 0 0|4.10 3.90 3.38| 122M 81M| 161M 208M
5 1 94 0 0|4.10 3.90 3.38| 102M 79M| 156M 236M
5 2 93 0 0|4.10 3.90 3.38| 110M 84M| 171M 228M
6 2 91 0 0|4.10 3.90 3.38| 107M 78M| 152M 217M
6 2 91 1 0|3.85 3.86 3.37| 110M 91M| 176M 227M
5 1 94 0 0|3.85 3.86 3.37| 97M 85M| 166M 205
root@proxmox-02:~# pveceph status
cluster:
id: f6706837-39f2-4e5d-adde-176269859e22
health: HEALTH_WARN
Reduced data availability: 34 pgs inactive, 34 pgs peering
Degraded data redundancy: 556816/11821971 objects degraded (4.710%), 93 pgs degraded, 93 pgs undersized
282 slow ops, oldest one blocked for 726 sec, daemons [osd.0,osd.11,osd.12,osd.19,osd.2,osd.21,osd.22,osd.3,osd.31,osd.33]... have slow ops.
services:
mon: 8 daemons, quorum proxmox-02,proxmox-03,proxmox-04,proxmox-05,proxmox-08,proxmox-09,proxmox-01,proxmox-07 (age 4h)
mgr: proxmox-03(active, since 28h), standbys: proxmox-09, proxmox-07, proxmox-05, proxmox-04, proxmox-08, proxmox-02, proxmox-01
mds: 1/1 daemons up, 6 standby
osd: 63 osds: 51 up (since 87s), 47 in (since 18m); 143 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 577 pgs
objects: 3.94M objects, 15 TiB
usage: 47 TiB used, 47 TiB / 94 TiB avail
pgs: 6.239% pgs not active
556816/11821971 objects degraded (4.710%)
116363/11821971 objects misplaced (0.984%)
429 active+clean
70 active+undersized+degraded+remapped+backfill_wait
32 remapped+peering
23 active+undersized+degraded+remapped+backfilling
18 active+remapped+backfill_wait
4 peering
1 active+clean+scrubbing
io:
client: 0 B/s rd, 145 KiB/s wr, 0 op/s rd, 19 op/s wr
recovery: 662 MiB/s, 181 objects/s