vSAN集群分区故障处理

故障现象
Skyline Health提示vSAN集群分区故障,显示集群的三个节点分别处于不同的分区

排查步骤
在三个节点分别登录SSH
1. 执行命令esxcli vsan cluster get,发现每个节点都是“Sub-Cluster Member Count: 1”,数量与集群的三节点不一致,与Skyline Health结果相同

[root@SH-VSAN02:~]  esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2024-10-01T17:40:17Z
   Local Node UUID: 63f4e01c-84c4-ada9-4667-00620b925480
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 63f4e01c-84c4-ada9-4667-00620b925480
   Sub-Cluster Backup UUID:
   Sub-Cluster UUID: 52a99051-c929-d59e-7cf3-7a049138ef11
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 63f4e01c-84c4-ada9-4667-00620b925480
   Sub-Cluster Member HostNames: SH-VSAN02
   Sub-Cluster Membership UUID: 7d26fc66-3648-56d1-2dd7-00620b925480
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: None 0 0.0
   Mode: REGULAR

2. 执行esxcli vsan network list,查看vSAN对应的VMkernel端口名称

[root@SH-VSAN01:~] esxcli vsan network list
Interface
   VmkNic Name: vmk1
   IP Protocol: IP
   Interface UUID: 52690f43-2a19-2c3f-a1ee-d93214e0a3bc
   Agent Group Multicast Address: 224.2.3.4
   Agent Group IPv6 Multicast Address: ff19::2:3:4
   Agent Group Multicast Port: 23451
   Master Group Multicast Address: 224.1.2.3
   Master Group IPv6 Multicast Address: ff19::1:2:3
   Master Group Multicast Port: 12345
   Host Unicast Channel Bound Port: 12321
   Data-in-Transit Encryption Key Exchange Port: 0
   Multicast TTL: 5
   Traffic Type: vsan

3. 执行esxcli network ip interface ipv4 get | grep vmk1,查看vSAN对应的VMkernel端口IP地址

4. 使用vmkping -I vmk1 <Host_VSAN_IP>,检查vSAN节点间网络是否互通

5. 分别执行esxcli vsan cluster unicastagent list,查看单播列表,其中一个正常,另外两个为空,正常如下

[root@SH-VSAN01:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name  Cert Thumbprint                                              SubClusterUuid
------------------------------------  ---------  ----------------  -------------  -----  ----------  -----------------------------------------------------------  --------------
63f4e01c-84c4-ada9-4667-00620b925480          0              true  172.16.99.207  12321              95:D3:03:9F:AB:4F:AF:DB:5D:2D:42:1B:7D:8A:5D:1C:F7:69:0B:31  52a99051-c929-d59e-7cf3-7a049138ef11
63f4f2ec-35c2-e5e9-e648-00620b9254b0          0              true  172.16.99.208  12321              BF:C0:B3:7A:66:99:9A:A5:B8:64:A4:FD:4D:69:56:89:72:31:5A:25  52a99051-c929-d59e-7cf3-7a049138ef11

解决方法
1. 上面的步骤排查发现两个节点缺少单播列表,需要进行构建

2. 在更改单播代理列表之前,在集群中的所有节点上运行以下命令,以暂时忽略来自vCenter的“集群成员列表更新”
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates

如果单播列表存在错误的记录需要删除,使用命令esxcli vsan cluster unicastagent remove -a <Host_VSAN_IP>

3. 使用命令添加单播列表,<Host_UUID>为esxcli vsan cluster get获取的“Local Node UUID”
esxcli vsan cluster unicastagent add -t node -u <Host_UUID> -U true -a <Host_VSAN_IP> -p 12321

4. 再次执行命令esxcli vsan cluster get,发现每个节点都是“Sub-Cluster Member Count: 3”,数量与集群节点相同,Skyline Health再次检查正常

[root@SH-VSAN02:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2024-10-01T18:57:25Z
   Local Node UUID: 63f4e01c-84c4-ada9-4667-00620b925480
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 63f4e01c-84c4-ada9-4667-00620b925480
   Sub-Cluster Backup UUID: 63f4d839-32da-07f5-a655-00620b784b80
   Sub-Cluster UUID: 52a99051-c929-d59e-7cf3-7a049138ef11
   Sub-Cluster Membership Entry Revision: 2
   Sub-Cluster Member Count: 3
   Sub-Cluster Member UUIDs: 63f4e01c-84c4-ada9-4667-00620b925480, 63f4d839-32da-07f5-a655-00620b784b80, 63f4f2ec-35c2-e5e9-e648-00620b9254b0
   Sub-Cluster Member HostNames: SH-VSAN02, SH-VSAN01, SH-VSAN03
   Sub-Cluster Membership UUID: 7d26fc66-3648-56d1-2dd7-00620b925480
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: 63f4e01c-84c4-ada9-4667-00620b925480 2 2024-10-01T18:55:36.0
   Mode: REGULAR

5. 在单播代理列表更改之后,在集群中的所有节点上运行以下命令,再次启用“集群成员列表更新”
esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates

来源:
https://www.dell.com/support/kbdoc/en-us/000056284/dell-emc-vxrail-node-is-showing-network-partitioned-even-it-can-ping-other-nodes-via-vmkping

https://knowledge.broadcom.com/external/article/326427/configuring-vsan-unicast-networking-from.html