vSAN集群分区故障处理
故障现象
Skyline Health提示vSAN集群分区故障,显示集群的三个节点分别处于不同的分区
排查步骤
在三个节点分别登录SSH
1. 执行命令esxcli vsan cluster get
,发现每个节点都是“Sub-Cluster Member Count: 1”,数量与集群的三节点不一致,与Skyline Health结果相同
[root@SH-VSAN02:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2024-10-01T17:40:17Z
Local Node UUID: 63f4e01c-84c4-ada9-4667-00620b925480
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 63f4e01c-84c4-ada9-4667-00620b925480
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52a99051-c929-d59e-7cf3-7a049138ef11
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 63f4e01c-84c4-ada9-4667-00620b925480
Sub-Cluster Member HostNames: SH-VSAN02
Sub-Cluster Membership UUID: 7d26fc66-3648-56d1-2dd7-00620b925480
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: None 0 0.0
Mode: REGULAR
2. 执行esxcli vsan network list
,查看vSAN对应的VMkernel端口名称
[root@SH-VSAN01:~] esxcli vsan network list
Interface
VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: 52690f43-2a19-2c3f-a1ee-d93214e0a3bc
Agent Group Multicast Address: 224.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Data-in-Transit Encryption Key Exchange Port: 0
Multicast TTL: 5
Traffic Type: vsan
3. 执行esxcli network ip interface ipv4 get | grep vmk1
,查看vSAN对应的VMkernel端口IP地址
4. 使用vmkping -I vmk1 <Host_VSAN_IP>
,检查vSAN节点间网络是否互通
5. 分别执行esxcli vsan cluster unicastagent list
,查看单播列表,其中一个正常,另外两个为空,正常如下
[root@SH-VSAN01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- ------------- ----- ---------- ----------------------------------------------------------- --------------
63f4e01c-84c4-ada9-4667-00620b925480 0 true 172.16.99.207 12321 95:D3:03:9F:AB:4F:AF:DB:5D:2D:42:1B:7D:8A:5D:1C:F7:69:0B:31 52a99051-c929-d59e-7cf3-7a049138ef11
63f4f2ec-35c2-e5e9-e648-00620b9254b0 0 true 172.16.99.208 12321 BF:C0:B3:7A:66:99:9A:A5:B8:64:A4:FD:4D:69:56:89:72:31:5A:25 52a99051-c929-d59e-7cf3-7a049138ef11
解决方法
1. 上面的步骤排查发现两个节点缺少单播列表,需要进行构建
2. 在更改单播代理列表之前,在集群中的所有节点上运行以下命令,以暂时忽略来自vCenter的“集群成员列表更新”esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates
如果单播列表存在错误的记录需要删除,使用命令esxcli vsan cluster unicastagent remove -a <Host_VSAN_IP>
3. 使用命令添加单播列表,<Host_UUID>为esxcli vsan cluster get
获取的“Local Node UUID”esxcli vsan cluster unicastagent add -t node -u <Host_UUID> -U true -a <Host_VSAN_IP> -p 12321
4. 再次执行命令esxcli vsan cluster get
,发现每个节点都是“Sub-Cluster Member Count: 3”,数量与集群节点相同,Skyline Health再次检查正常
[root@SH-VSAN02:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2024-10-01T18:57:25Z
Local Node UUID: 63f4e01c-84c4-ada9-4667-00620b925480
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 63f4e01c-84c4-ada9-4667-00620b925480
Sub-Cluster Backup UUID: 63f4d839-32da-07f5-a655-00620b784b80
Sub-Cluster UUID: 52a99051-c929-d59e-7cf3-7a049138ef11
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 63f4e01c-84c4-ada9-4667-00620b925480, 63f4d839-32da-07f5-a655-00620b784b80, 63f4f2ec-35c2-e5e9-e648-00620b9254b0
Sub-Cluster Member HostNames: SH-VSAN02, SH-VSAN01, SH-VSAN03
Sub-Cluster Membership UUID: 7d26fc66-3648-56d1-2dd7-00620b925480
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 63f4e01c-84c4-ada9-4667-00620b925480 2 2024-10-01T18:55:36.0
Mode: REGULAR
5. 在单播代理列表更改之后,在集群中的所有节点上运行以下命令,再次启用“集群成员列表更新”esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates
https://knowledge.broadcom.com/external/article/326427/configuring-vsan-unicast-networking-from.html