The original post: /r/homelab by /u/Same-Cardiologist-58 on 2025-01-18 07:39:36.
I have a Dell PowerEdge XC730xd-12.
It has a valiant PCI x4 adaptor with a 4TB Crucial NVMe installed in slot 3
The NVME drive has Proxmox installed onto it and has been running without issue for the last 6 months.
The raid controller is in HBA mode and all of the drives are in non-raid mode passed through to a TrueNAS VM. (This was rather fun to set up)
Today the Proxmox interface was no longer showing up over the IP address, however,all of the VMs that were running on the NVMe were still working as I could access every VM, just not Proxmox itself, so I restarted the server. It hung on initialising firmware for ages and then eventually I got an error saying there was an issue with a storage controller and to power down the machine and reboot, so I turned off the machine and pulled out the power for 15 seconds. restarted the machine, and the storage controller error went away, but the machine now no longer sees the NVMe drive over PCIe. the PCI card lights are working.
In the lifecycle controller logs, I can see that the machine detects the PCIe SSD card reader and gives a UEFI message that its detected. but would not load Proxmox still. I then moved the PCI card to a different slot (Slot 2), and it detected that the card was moved but could still not see Proxmox in the boot menu. The NVMe appears to be fine but is a bit hot.
I have run the lifecycle controller hardware diagnostic check, all of the checks except one unrelated one passed. the only issue was a cable error with the intrusion detection system, which has been a problem for ages and not had any issues with the system other than the constant warning that the chassis is open when it isn’t.
If anyone has any recommendations for me as to how I can get the NVMe drive to be bootable again by the server that would be amazing. It is clearly still working as before the reboot all the vms were still fine, just not Proxmox for some reason.