故障简单描述:
Esxi服务器无法识别LUN存储,其中部分Esxi服务器可以识别多个LUN中的一个或者多个,部分Esxi服务器一个也识别不到。可以通过在配置-存储-设备中可以看到存储,但是无法挂载。

一、系统版本:
Esxi 6.0.0-2494585

二、硬件型号:
IBM x3850 X5

三、故障现象:

  1. Esxi主机无法挂载存储,在配置-存储-设备中可以看到存储;
  2. /var/log/vmkernel.log日志报错如下(部分):

    2020-12-25T09:02:53.502Z cpu26:33700)ScsiDeviceIO: 2608: Cmd(0x43aa00be5740) 0x28, CmdSN 0xd from world 91639 to dev "naa.600d02310006cc867921fc910fcfc42e" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
    2020-12-25T09:03:04.074Z cpu1:33207)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x28 failed <0/0> sid x010800, did x010000, oxid xffff iotag xa08 Abort Requested Host Abort Req
    2020-12-25T09:03:04.074Z cpu26:33700)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x28 (0x43aa00be5740, 91639) to dev "naa.600d02310006cc867921fc910fcfc42e" on path "vmhba3:C0:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
    2020-12-25T09:03:04.074Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T09:03:04.074Z cpu26:33700)ScsiDeviceIO: 2646: Cmd(0x43aa00be5740) 0x28, CmdSN 0xd from world 91639 to dev "naa.600d02310006cc867921fc910fcfc42e" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
  1. /var/log/vmwarning.log日志报错如下(部分):

    2020-12-25T08:53:51.390Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc860a4fdaaa61e770b1" state in doubt; requested fast path state update...
    2020-12-25T08:53:53.516Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc862cb127cc0fe225e7" state in doubt; requested fast path state update...
    2020-12-25T08:53:53.516Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:54:04.049Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc862cb127cc0fe225e7" state in doubt; requested fast path state update...
    2020-12-25T08:54:04.086Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:54:13.515Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:54:20.515Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc860a4fdaaa61e770b1" state in doubt; requested fast path state update...
    2020-12-25T08:54:24.031Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:54:24.031Z cpu60:91639)WARNING: Partition: 1157: Partition table read from device naa.600d02310006cc867921fc910fcfc42e failed: I/O error
    2020-12-25T08:54:31.394Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc860a4fdaaa61e770b1" state in doubt; requested fast path state update...
    2020-12-25T08:54:33.515Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:54:33.515Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc862cb127cc0fe225e7" state in doubt; requested fast path state update...
    2020-12-25T08:54:44.052Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc862cb127cc0fe225e7" state in doubt; requested fast path state update...
    2020-12-25T08:54:44.089Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:54:53.514Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...
    2020-12-25T08:55:00.514Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc860a4fdaaa61e770b1" state in doubt; requested fast path state update...
    2020-12-25T08:55:04.035Z cpu26:33700)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.600d02310006cc867921fc910fcfc42e" state in doubt; requested fast path state update...

四、故障排查:

  1. 用户反馈无法挂载存储,对光交进行了摸排,发现和Esxi主机部分链路光衰过大,对衰减过大的线缆进行了更换;
  2. 更换后配合存储工程师检查映射及存储日志,未发现故障;
  3. 使用 partedUtil getptbl /dev/disks/naa.600d02310006cc867921fc910fcfc42e命令获取分区表信息时,会卡10-20分钟;
  4. 开始对Esxi服务器进行了重启,最后重启了所有的Esxi服务器,故障依旧;
  5. 根据报错结果查询:
    (1)部分报错日志:
2020-12-25T09:02:53.502Z cpu26:33700)ScsiDeviceIO: 2608: Cmd(0x43aa00be5740) 0x28, CmdSN 0xd from world 91639 to dev "naa.600d02310006cc867921fc910fcfc42e" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2020-12-25T09:03:04.074Z cpu26:33700)ScsiDeviceIO: 2646: Cmd(0x43aa00be5740) 0x28, CmdSN 0xd from world 91639 to dev "naa.600d02310006cc867921fc910fcfc42e" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

(2)通过官方文档给出的诊断结果如下:

| 故障描述 | 报错日志 | 说明 |
| - | - | - |
| VMK_SCSI_HOST_BUS_BUSY = 0x02 or 0x2 | vmkernel: 116:03:44:19.039 cpu4:4100)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100020e0b00) to NMP device "sym.029010111831353837" failed on physical path "vmhba2:C0:T0:L152" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. | 当 HBA 驱动程序无法向设备发出命令时,将返回此状态。出现此状态的原因可能是在环境中丢弃了 FCP 帧。 |
| VMK_SCSI_HOST_ABORT = 0x05 or 0x5 | vmkernel: 0:00:13:23.910 cpu20:4251)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100010bf9c0) to NMP device "naa.60060480000190103838533030363542" failed on physical path "vmhba3:C0:T0:L4" H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. | 如果驱动程序必须对目标中止正在执行的命令,将返回此状态。出现此状态的原因可能是命令超时或帧中出现奇偶校验错误。 |

五、故障分析
从以上日志分析得出,Esxi一直向存储发出挂载请求,但是未收到正确的返回。

六、解决方案
重启存储后,所有Esxi主机顺利挂载所有LUN存储。

PS:

  1. 在处理该故障的过程中,本身凭借经验判断Esxi主机故障的可能性不大,但是还是建议用户对Esxi主机进行了重启,以便于进一步确认故障点。
  2. 为了进一步定位问题,重新了安装了一台Esxi 6.5的服务器,故障现象依旧;
  3. 反复向存储工程师确认存储是否存在故障,存储厂商一直反馈正常,最后迫不得已对存储进行了重启,故障迅速解决(浪潮存储)。

参考文档:
https://kb.vmware.com/s/article/1029039?lang=zh_CN
https://kb.vmware.com/s/article/2004086
https://virtuallyhyper.com/2012/11/host-with-emulex-nc553i-cna-disconnects-from-strorage/
https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/?host=2&device=0&plugin=0&sensekey=0&asc=0&ascq=0&opcode=28
https://kb.vmware.com/s/article/2014155?lang=zh_CN



最后修改:2023 年 01 月 17 日
如果觉得我的文章对你有用,请随意赞赏