I’m looking for some basic troubleshooting procedures to diagnose a failing server.
One of my Xen Server nodes is frequently becoming unresponsive overnight. Usually, the server is unresponsive, drops out of the HA cluster, and the system fans are running full blast.
What tools would you use to figure out what’s up?
What model of server ?
Does it have any diagnostics tools ?
It’s one of these:
The only thing I have installed is XCP-NG and netdata.
I believe you have IPMI on that server. If you connect to the IPMI interface, you can view the server logs even if the OS has crashed or the server is off.
Logs on that version are not great, but a starting point. IPMI viewer from Supermicro is a useful tool to have, but you can get in from the web client too.
Thanks for the suggestions! I was able to view the server logs and see that a stick of ram had errors. Swapped it out and I’m back in action.
Well… out of action again. I’m getting a “Single bit ECC memory error” in the same slot of a module that I’ve already replaced. Is it possible that the replacement stick also went bad?
I could just get another used server for $300, but I’d like to see this through to a valid conclusion.
What would you do next?
Can you move the memory to another slot , to see if you get the same issues
Pull the CPU and look for bent pins, if any are present straighten them and try again. I had some Intel branded boards that did this all the time, heat cycles or fan vibrations would cause the pins to move and they would get phantom RAM missing issues.
I tried all of your suggestions. Even replaced all the RAM, and swapped processors. Same error in the same memory address… so I can only assume it’s the RAM slot on the motherboard. Going to swap that out and see what happens.
Can I just swap out the mobo and boot up the existing installation of XCP-ng? or is this a re-install situation?
Typically yes you can just swap the main board. Had to edit because I realized this was a Xen thing not a Truenas thing.