I’m looking at a new Homelab XCP-ng server with low power usage but reliable.
My current thoughts on the Homelab hardware is:
Ryzen 9 7900
motherboard with B650M chips set and 1 PCIe slot (for running an LLM in future) with 2.5Gb network card with expansion to a 10gb network card.
64Gb RAM ECC ram
1Tb Gen4 M.2 SSD
Case and 500w power supply
I want to run 8 VM Max, i will be migrating 2 VMs using 2 vCPU’s and 2gb ram each.
Is ECC ram worth the price?
Is it worth going to a Gen5 SSD?
Is the power supply powerful enough for a mid level graphics card as the processor is only 65W TDP?
Thanks Tom, I want to save my pennies. I have heard that ZFS would be more of a priority for ECC so i when i get to get Truenas up and running this might be where i spend my pennies.
If you know that you will need ECC eventually, not buying it now will cost you more money in the long run. I would personally look at your work loads and see if you really need an AM5 platform. ECC memory is much cheaper in AM4. Also, regardless of which platform you choose, in my experience, a 12 core/24 thread CPU is really overkill for 64 gb of total memory. You will run out of memory long before you run out of CPU cores. My server has a 6 core/12 thread Ryzen 5 pro 5650GE and 64 gb of memory. My CPU barely goes above 25% utilization. I am not sure about XCP-NG, but I assume it is similar to Proxmox. In Proxmox I can “over subscribe” CPU cores, without any real negative effects. I cannot over subscribe memory though.
TL;DR version of the scenario: ZFS is on a system with non-ECC RAM that has a stuck bit, its user initiates a scrub, and as a result of in-memory corruption good blocks fail checksum tests and are overwritten with corrupt data, thus instantly murdering an entire pool. As far as I can tell, this idea originates with a very prolific user on the FreeNAS forums named Cyberjock, and he lays it out in this thread here. It’s a scary idea – what if the very thing that’s supposed to keep your system safe kills it? A scrub gone mad! Nooooooo!
The problem is, the scenario as written doesn’t actually make sense. For one thing, even if you have a particular address in RAM with a stuck bit, you aren’t going to have your entire filesystem run through that address. That’s not how memory management works, and if it were how memory management works, you wouldn’t even have managed to boot the system: it would have crashed and burned horribly when it failed to load the operating system in the first place. So no, you might corrupt a block here and there, but you’re not going to wring the entire filesystem through a shredder block by precious block.
However, one could then argue that if ECC can prevent this, that alone would be reason enough to use it in every NAS system, regardless of the file system.
Read the entire post from the link or at least this part:
let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:
First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?
Next, you read a copy of the same block – this copy might be a redundant copy, or it might be reconstructed from parity, depending on your topology. The redundant copy is easy to visualize – you literally stored another copy of the block on another disk. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk. No data has been corrupted. A later scrub will attempt to read all copies of that block and validate them just as though the error had never happened, and if this time either copy passes, the error will be cleared and the block will be marked valid again (with any copies that don’t pass validation being overwritten from the one that did).
So if I understand this correctly, one could actually argue that ECC memory is less critical with ZFS than with other file systems that lack such advanced data integrity features.
That’s pretty much what I’ve always believed. But of course, having ECC doesn’t hurt. And when you consider that a NAS in a homelab might run for 10 years or more, especially if it’s just used as a data archive, then ECC is probably a worthwhile investment.
The same goes for using a reliable motherboard (e.g., Supermicro) and a high-quality power supply (e.g., Seasonic). In total, such setup might cost mayne 200–300 bucks more, but spread over 10 years, that’s only about $2.50 per month, which I think is well worth it for peace of mind.
Btw. While ZFS can detect and handle many types of data corruption, it still depends on RAM being trustworthy. If the RAM silently flips bits before the data is written to disk, ZFS might not always be able to detect or recover from it. Or that’s at least how I understand it.