HPE Gen10 DL385 power issues

Hi,

I have a weird (possibly) power related issue with our HPE Server. Last 3-4 days it did it 2x. Randomly turns off the server and the iLO is unreachable. When I check it, the 2x PSU light is on, but I can’t turn it on.
After unplugging both cables (I tried with one only), the server restarts.

  • Proxmox has no information or log info about an OS triggered restart
  • today I updated everything with the latest SPP + BIOS to v4.02.

iLO has the following log, but nothing useful. When I try to run Hardware diagnostics, I got this error. (No matter if I run it from the BIOS or Intelligent Provisioning)

Do you have HP SmartMemory? Is the heat sink compond still good on the CPUs? Might be something there. Check iLO for the heat level, etc. Might be a dead end, though…

I have original, certified memory modules. They all report ok in iLO.

Regarding to the temperature, I don’t see any unusual high spots. I found a loos paper (just the small serial number sticker) and I changed the RTC battery just in case. (It was 3.11 V, the new has 3.2 V that should not be the problem.)

PCI Zone1 seems rather high. You have an add-in card in that slot? I wonder if you have an overheating problem for that card.

I was thinking about that. I did some tests, pushing the CPU to 50-60%, doing fio on the 10Gb card, but nothing happened.
I will do another attempt today, when I can observe it in person.

Pushing it for 20 minutes, only changed 67 to 73 C

Moving the network card down one PCI-E slot and slightly increasing the fan speed decreased the temp from 68 C to 54 C.

Usually the fans didn’t ramp up more than 17%, just occasionally. But it is in a temperature controlled room. Now the fans idle around 34%.

Only one thing left…convince HPE Support to talk with us on an hourly rate without purchasing a support contract.

Not sure if this is the fix? Worth a look..

Support said, I should start with minimal config (1 RAM stick / CPU) and remove everything. Gradually add back devices. (But I didn’t see the problem since Monday).

I removed all drives and I was able to start the built in diagnostic tool from the BIOS.

Which indicates that something is not OK with the Proxmox Install

Have you tried failing back to the previous firmware/BIOS to see if the issue persists on the previous version?