vSAN物理磁盘故障处理
ESXi主机开启SSH,执行以下命令排查问题
检查 vSAN 物理磁盘状态
检查“IsPDL”(永久设备丢失)参数。如果等于 1,则磁盘丢失。
vdq -qH
示例:
DiskResults:
DiskResult[0]:
Name: naa.5000039c181a6de9
VSANUUID: 527b32db-a6c2-d457-5132-e4c2a2241368
State: In-use for VSAN
Reason: None
StoragePoolState: Ineligible for use by Storage Pool
StoragePoolReason:Disk in use by disk group
IsSSD?: 0
IsCapacityFlash?: 0
IsPDL?: 0 //如果等于 1,则磁盘丢失
Size(MB): 2289272
FormatType: 512e
IsVsanDirectDisk?: 0
检查磁盘组中是否缺少磁盘。
vdq -iH
示例:
Mappings:
DiskMapping[0]:
SSD: naa.5002538b225cc2f0
MD: naa.5000039c181a6de9
MD: naa.5000039c181a707d
MD: naa.5000039c181a7001
MD: naa.5000039c181a7005
MD: naa.5000039c181a6e29
MD: naa.5000039c181a7011
检查“In CMMDS”参数。如果为 false,则与磁盘的通信会丢失。
esxcli vsan storage list
示例:
naa.5000039c181a6de9
Device: naa.5000039c181a6de9
Display Name: naa.5000039c181a6de9
Is SSD: false
VSAN UUID: 527b32db-a6c2-d457-5132-e4c2a2241368
VSAN Disk Group UUID: 52874f04-d659-0f52-8ac2-35aa05702568
VSAN Disk Group Name: naa.5002538b225cc2f0
Used by this host: true
In CMMDS: true //如果为 false,则与磁盘的通信会丢失
On-disk format version: 17
Deduplication: false
Compression: false
Checksum: 13704721513334665797
Checksum OK: true
Is Capacity Tier: true
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: true
Creation Time: Sat Dec 10 17:16:48 2022
使用smart get 命令检查读/写错误。
列出所有硬盘naaesxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa"
示例:
naa.5000039c181a6de9
naa.5000039c181a707d
naa.5002538b225cc2f0
naa.5000039c181a7001
naa.5000039c181a7005
naa.5000039c181a6e29
naa.5000039c181a7011
查看S.M.A.R.T.信息esxcli storage core device smart get -d naa.5000039c181a6de9
示例:
Parameter Value Threshold Worst Raw
----------------- ----- --------- ----- ---
Health Status OK N/A N/A N/A
Write Error Count 0 N/A N/A N/A
Read Error Count 557 N/A N/A N/A
Power Cycle Count 31 N/A N/A N/A
Drive Temperature 27 N/A N/A N/A
检查可用的磁盘组。
esxcli vsan storage list | grep "VSAN Disk Group UUID:" | sort | uniq -c
示例:
7 VSAN Disk Group UUID: 52874f04-d659-0f52-8ac2-35aa05702568
检查是否存在正在进行或停滞的重新同步操作。
while true;do echo " ****************************************** "; echo "" > /tmp/resyncStats.txt ;cmmds-tool find -t DOM_OBJECT -f json |grep uuid |awk -F \" '{print $4}' |while read i;do pendingResync=$(cmmds-tool find -t DOM_OBJECT -f json -u $i|grep -o "\"bytesToSync\": [0-9]*,"|awk -F " |," '{sum+=$2} END{print sum / 1024 / 1024 / 1024;}');if [ ${#pendingResync} -ne 1 ]; then echo "$i: $pendingResync GiB";fi;done |tee -a /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |awk '{sum+=$2} END{print sum}');echo "Total: $total GiB" |tee -aa /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |grep Total);totalObj=$(cat /tmp/resyncStats.txt|grep -vE " 0 GiB|Total"|wc -l);echo "`date +%Y-%m-%dT%H:%M:%SZ` $total ($totalObj objects)" >> /tmp/totalHistory.txt; echo `date `; sleep 60; done
示例:
Total: 0 GiB
Mon Mar 11 02:14:59 UTC 2024
按Ctrl+C停止命令
检查组件的状态。
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c
正常:状态 7
无法访问:状态 13
不存在或降级:状态 15
示例:
71 state\": 7
确定故障硬盘的位置:
列出所有硬盘naaesxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa"
示例:
naa.5000039c181a6de9
naa.5000039c181a707d
naa.5002538b225cc2f0
naa.5000039c181a7001
naa.5000039c181a7005
naa.5000039c181a6e29
naa.5000039c181a7011
使用naa查看硬盘位置esxcli storage core device physical get -d naa.5000039c181a6de9
示例:
Physical Location: enclosure 0 slot 6
查找已经丢失的硬盘:
使用以下脚本
echo "=============Physical disks placement=============="
echo ""
esxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa" | while read in; do
echo "$in"
esxcli storage core device physical get -d "$in"
sleep 1
echo "===================================================="
done
未找到的就是故障硬盘,也可以在服务器的iDRAC中查看
示例:
=============Physical disks placement==============
naa.5000039c181a6de9
Physical Location: enclosure 0 slot 6
====================================================
naa.5000039c181a707d
Physical Location: enclosure 0 slot 2
====================================================
naa.5002538b225cc2f0
Physical Location: enclosure 0 slot 0
====================================================
naa.5000039c181a7001
Physical Location: enclosure 0 slot 1
====================================================
naa.5000039c181a7005
Physical Location: enclosure 0 slot 3
====================================================
naa.5000039c181a6e29
Physical Location: enclosure 0 slot 5
====================================================
naa.5000039c181a7011
Physical Location: enclosure 0 slot 4
====================================================
相关日志
/var/log/vmkernel.log
读取和写入 vSAN 磁盘、vSAN 主机心跳信号、PDL、SCSI 感知代码和 I/O 请求(读取/写入)以及群集成员身份信息时出现问题。
示例:
2024-03-09T18:50:51.413Z Wa(180) vmkwarning: cpu6:2098013)WARNING: ScsiDeviceIO: 1774: Device naa.5000039c181a7005 performance has deteriorated. I/O latency increased from average value of 11487 microseconds to 7116618 microseconds.
2024-03-09T18:51:06.727Z Wa(180) vmkwarning: cpu61:2098012)WARNING: HPP: HppThrottleLogForDevice:1133: Cmd 0x28 (0x45dbf966c400, 0) to dev "naa.5000039c181a7005" on path "vmhba3:C0:T4:L0" Failed:
2024-03-09T18:51:06.727Z Wa(180) vmkwarning: cpu61:2098012)WARNING: HPP: HppThrottleLogForDevice:1141: Error status H:0x5 D:0x0 P:0x0 . hppAction = 3
/var/log/vobd.log
报告磁盘运行状况、永久设备丢失磁盘 (PDL)、磁盘延迟,并报告主机何时进入和退出维护模式。
示例:
2024-03-09T18:08:10.611Z In(14) vobd[2097697]: [vSANCorrelator] 20883483894278us: [vob.vsan.lsom.devicerepair] vSAN device 5234107b-5200-c452-6c05-99f3bb102a7f is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.
2024-03-09T18:08:10.611Z In(14) vobd[2097697]: [vSANCorrelator] 20883202034798us: [esx.problem.vob.vsan.lsom.devicerepair] Device 5234107b-5200-c452-6c05-99f3bb102a7f is in offline state and is getting repaired.
2024-03-09T18:08:10.621Z In(14) vobd[2097697]: [vSANCorrelator] 20883483904364us: [vob.vsan.pdl.offline] vSAN device 5234107b-5200-c452-6c05-99f3bb102a7f has gone offline.
2024-03-09T18:08:10.621Z In(14) vobd[2097697]: [vSANCorrelator] 20883202044628us: [esx.problem.vob.vsan.pdl.offline] vSAN device 5234107b-5200-c452-6c05-99f3bb102a7f has gone offline.
/var/log/vsandevicemonitord.log
它可帮助您确定磁盘是否由于过度日志拥塞或 I/O 延迟而被标记为不正常。
示例:
2024-03-09T18:08:38Z In(14) vsandevicemonitord[2100160]: Unmount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:08:38Z In(14) vsandevicemonitord[2100160]: Device naa.5000039c181a7005 was already unmounted.
2024-03-09T18:28:49Z In(14) vsandevicemonitord[2100160]: stderr Errors:
2024-03-09T18:28:49Z In(14)[+] vsandevicemonitord[2100160]: Unable to mount: Disk with vSAN uuid 5234107b-5200-c452-6c05-99f3bb102a7f failed to appear in CMMDS
2024-03-09T18:28:49Z In(14)[+] vsandevicemonitord[2100160]: , stdout from command vsan storage diskgroup mount -d naa.5000039c181a7005.
2024-03-09T18:28:49Z In(14) vsandevicemonitord[2100160]: Mounting failed on VSAN device naa.5000039c181a7005.
2024-03-09T18:28:49Z In(14) vsandevicemonitord[2100160]: Repair attempt 1 for device 5234107b-5200-c452-6c05-99f3bb102a7f
2024-03-09T18:38:50Z In(14) vsandevicemonitord[2100160]: Sample latency intervals for naa.5002538b225cc2f0 are [0, 2, 5, 7, 9, 10].
2024-03-09T18:38:50Z In(14) vsandevicemonitord[2100160]: Resetting repair attempt for device 5234107b-5200-c452-6c05-99f3bb102a7f
2024-03-09T18:38:52Z In(14) vsandevicemonitord[2100160]: Unmount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:38:52Z In(14) vsandevicemonitord[2100160]: Device naa.5000039c181a7005 was already unmounted.
2024-03-09T18:40:10Z In(14) vsandevicemonitord[2100160]: Mount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:40:10Z In(14) vsandevicemonitord[2100160]: Repair successful for device 5234107b-5200-c452-6c05-99f3bb102a7f
2024-03-09T18:50:13Z In(14) vsandevicemonitord[2100160]: Unmount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:50:13Z In(14) vsandevicemonitord[2100160]: Device naa.5000039c181a7005 was already unmounted.
2024-03-09T18:50:27Z In(14) vsandevicemonitord[2100160]: Mount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:50:27Z In(14) vsandevicemonitord[2100160]: Repair successful for device 5234107b-5200-c452-6c05-99f3bb102a7f
来源:
https://www.dell.com/support/kbdoc/en-us/000209262/vsan-physical-disk-troubleshooting-guide?lang=zh