I have a Supermicro X10SDV-TLN4F server on which I installed TrueNAS last week. However, since then it has had unscheduled reboots at random moments.
TL;DR: I think it has something to to with the HDD and possibly the PSU. How to further proceed to find and fix the problem?
To find the reason I:
- Checked logs /var/log/* but nothing of interest
- Checked logs in BMC and saw Watchdog Timer Interruption - Assertion followed up by a PowerCycle - Assertion
- Checked system sensor values but all were good
- Updated firmware BMC and BIOS to latest version
- Performed memtest (passed 100%)
- Searched a lot of forums and found a lot of similar issues between Supermicro and FreeBSD (Truenas) but as it also rebooted the server while in BIOS settings, that would indicate a different cause than the combination of Supermicro and OS.
ipmitool mc watchdog off did not prevent reboots from happening. Including disabling it via the jumper
- Thought it could be the SSD and uncoupled with one reboot, but made no difference.
- Tried to live-boot with Ubuntu, but rebooted everytime in the process of booting.
- Installed XCP-NG and except for 1 reboot in the installation process (don’t know if that was intended), it has been running stable now.
- When it was stable in XCP-NG I realized it did not address the two HDDs that were for TrueNAS pool storage. So after installing TrueNAS in XCP-NG with HDD passthrough, the moment I started importing the existing pool, the system immediately shutdown and then rebooted.
So it seems to be (one of) the HDDs (which are brand new) or maybe the PSU when the HDDs are performing? How to proceed further to find and fix the problem?
2x Seagate IronWolf Pro 20TB
If you had 2 other drives, you could try replacing them and see if it keeps happening, but that would be an odd issue for a data drive to cause a reboot.
True. I just have disconnected the signal cable from both data drives, live booted ubuntu successfully and then reconnected the two drives. Formatted them and have run ‘SMART Data & Self-Tests’ which all have passed. So I’m now confused as to what it could be
The power error in the BMC (IPMI) can result from any power off/on situation, every time you shut it down it should give you something similar in the “error” log.
The power supplies are not cheap (if new) so just swapping a new one in is not an option. Is it a redundant PS chassis? If so you could pull one of them and see what happens, then switch after a while and again see what happens.
I have 10 different model X10 boards on my Truenas and never really a problem, pretty much every Supermicro system I’ve had has been solid.
The only thing that once bothered me, was RAM. It seemed like a bad socket to I moved the modules to the other channel and went on for many years. It’s unsupported this way, but it did work.
Also check and make sure you have RDIMM or UDIMM, not a mix of them and not regular DIMM modules.
And finally, if it is under warranty, I’d contact Supermicro. Even if not under warranty they still should offer email support, it is still a current product and maybe they have some utilities to help figure out where the problem might be.
I happen to have an identical server at home with Truenas and no problems.
It has no redundant PS chassis.
I did do the memtest, wouldn’t any socket problems have appeared in that test?
I’m back at the office tomorrow and can check RDIMM and UDIMM.
There’s no warranty anymore, but I’m more interested in finding the root of the problem. Which I’m still not completely understanding. So far it does seem to be corresponding with the hard drives. It was very interesting to see the system shutdown the moment I wanted to import the pool that was on the hard drives. I wonder if it could have something to do with encryption. Because later when after live booting in Ubuntu and formatting the drives nothing happened. Are there some sensible options I can still research to find the problem? Or should I start thinking of alternative hardware? It’s for a client and at some time it’s just cheaper/better to buy different hardware.
The RAM is four times ‘32gb 2Rx4 PC4-2400T-RB2-11’ so all the same.
I came across the following: FAQ Entry | Online Support | Support - Super Micro Computer, Inc.
However, the Seagate ST20000NE000 does not seem to be native 4k, but I will open a support query with Supermicro and as you said, perhaps they have utilities to help figure out the problem.
Thank you for your help! Always good to be able to exchange thoughts on these kind of problems.