Recently I was able to finally build my NAS. Not knowing any better, I’ve been using the stock heatsink that came with the CPU. Every several weeks or so, I come home and the server is no longer running. At first I didn’t know what was happening, but then it started beeping while I was home. I looked up the error code and found the CPU was overheating. When I opened it up, I found the heatsink was a little loose. I haven’t assembled a computer in about two decades. I assumed I hadn’t mounted it correctly. I made sure that all four pins were clicked all the way in, but yesterday the server was powered down when I got home from work. I assumed the heatsink was loose again, but when I checked it, it was tight and secure.
I’m running the latest stable version of FreeNAS with only two jails (Plex and Nextcloud). I am happy to provide other details if needed. I’m not overclocking the server, and I barely put any load on it at all. I’d really like to get to the bottom of this shutdown issue so I can have a reliable server. Does anybody have a suggestion?
I have the same CPU and motherboard in a system and have not had overheating issues.
Is the server in an enclosed space that is heating up?
What chassis are you using?
What other fans do you have connected?
Is the CPU Fan plugged into the “A” fan port? (FANA is the CPU fan, FAN1 through FAN4 are for anything else)
In IPMI, what does it report for fan speed and CPU temperature (note that the temperature sensor reported in IPMI isn’t the inside the CPU, but rather is on the motherboard in the middle of the socket)
The heatsink came with thermal paste pre-applied, so I didn’t think I needed any extra. I’ve checked the CPU usage and other stats. Everything is low as expected.
I checked the temperatures every day or two for a couple weeks after making sure the pins were clicked all the way around Thanksgiving. The CPU was always around 30C.
What IPMI and BIOS versions are you running? My stable system has IPMI 01.45 and BIOS 2.1a. I see the latest versions are 01.58 and 2.2a.
In the IPMI, under Server Health > Event Log, you should be able to see exactly why it shut down, as well as a history of any CPU health or other alerts.
Server Health -> Event Log doesn’t show anything after 2019-08-04. I don’t know why; there appears to be room for 512 entries, and only 15 are listed. Maintenance -> System Event Log appears to have current logs, but none seem very interesting.
Is the overheating mesaje gone?
If yes…
Other things to check.
Loose power cable. (to the wall, and internally)
Are you using a UPS?
Check for leaked capacitors on the motherboard or power supply. (be careful, there are dangerous voltages inside power supplies and capacitors can hold charge for long time, even unplugged)
Reseat the memories.
Try only with half the memories and then with the other half.