Proxmox Cluster Storage ZFS VS CEPH

To help choose what setup works for you, we need to first talk about what the requirements for each.

What Ceph actually requires

Ceph is a distributed storage system, which means it is designed from the ground up to spread data across multiple nodes simultaneously. That is a genuine strength at scale, but it comes with real infrastructure requirements that are easy to underestimate.

For a Ceph cluster to run correctly you need a minimum of three nodes for proper quorum. With the default replication factor of three, every write has to be confirmed across all three nodes before it is acknowledged. That means your storage performance is directly tied to the latency between nodes. A dedicated low-latency storage network is not optional, it is a requirement. That means dedicated NICs on every node and a separate switch for Ceph traffic. If you are running Ceph on the same network as your VM traffic you are going to have a bad time.

Each node also needs to run multiple daemons. At minimum you are looking at a MON (monitor) daemon for cluster quorum, a MGR (manager) daemon for cluster state, and one OSD (object storage daemon) process per drive. All of these need to be healthy for the cluster to function normally. When a node goes down, the remaining nodes immediately begin rebalancing data, which puts significant load on your network and your remaining drives. If a second node goes down during that recovery window on a three node cluster, you lose quorum and your storage goes offline entirely.

That is not a hypothetical edge case. That is the expected failure mode of a three node Ceph cluster under real conditions.

For further reading check out the Prxomox Ceph Docs

What ZFS gives you instead

ZFS operates at the individual node level. Each node manages its own pool independently. You get checksumming and automatic data integrity verification, compression, snapshots, and the ability to send incremental snapshots to another node with zfs send. That last part is how you get off-node copies without shared infrastructure: Proxmox has native ZFS replication built into the UI that handles this simply and reliably.

When a ZFS node has a problem, that problem stays on that node. Your other nodes keep running. Recovery is well-documented, the tooling is straightforward, and you do not need deep expertise in a complex distributed system to get yourself back to a healthy state.

The tradeoff is that ZFS is not shared storage. VMs are tied to the node they live on and live migration requires you to move the disk as well, which takes time. For most small cluster setups that is an acceptable tradeoff given what you get in return.

Shared storage as an alternative HA approach

A dedicated storage server gives you a middle ground that neither ZFS local storage nor Ceph offers at small scale. All three nodes mount the same storage, so a VM is not tied to the node it started on. Live migration is nearly instant because the disk does not move, only the running state transfers between nodes. If a compute node goes down, any VM on it can be restarted on another node in seconds.

You are writing to one server over a dedicated storage network rather than coordinating writes across a cluster, so there is no cross-node acknowledgment in the write path. This can give much better write performance.

The big tradeoff is that the storage server is a single point of failure. If it goes down, all three compute nodes lose access to their VMs simultaneously. You can minimize the risks with good hardware with redundancies, on the storage server itself, but you cannot eliminate the dependency. For most small deployments that risk is acceptable and the operational simplicity makes it a strong option, but it needs to be part of your planning going in.

Where Ceph actually makes sense

Ceph becomes the right answer when you have the infrastructure to support it properly. That means enough nodes that you can lose one during a recovery event without risking quorum, a dedicated storage network with proper switching, and someone on your team who understands Ceph well enough to debug a degraded cluster under pressure. As a rough rule I would not consider Ceph until you are at five nodes or more, and even then only if those other conditions are met.

Summary

The decision between ZFS and Ceph for a small Proxmox cluster really comes down to one question: do you need the complexity that Ceph requires, or does it just sound like the right answer because that is what the large deployments use?

For most small deployments the answer is no. Ceph’s write path is genuinely more expensive than people realize. Every write has to be confirmed across all three nodes before it is acknowledged and the network round-trip is in the critical path of every single write operation. ZFS writes locally and it is done. No network round-trip, no cross-cluster acknowledgment, no tuning a distributed journal per OSD. That difference shows up in real workloads.

Ceph does have a legitimate strength on reads. A well-tuned cluster can stripe reads across multiple OSDs and deliver impressive sequential throughput. But “well-tuned” requires the right hardware, a proper dedicated storage network, and the operational knowledge to actually get there. Most small clusters never do.

ZFS gives you predictable, consistent performance without needing to tune a distributed system. The ARC cache means frequently accessed data is fast, checksumming means your data is what you think it is, and zfs send means you have off-node copies without shared infrastructure. When something goes wrong – and eventually something always goes wrong – the failure domain stays on the node where the problem is, and the tools to recover are well-documented and straightforward.

Ceph is excellent software. It is the right answer at scale with the infrastructure to support it. For two or three nodes in a homelab or on-prem for a small client, ZFS is the better choice and I would take it every time.

1 Like

I use ceph for Windows AD DC, DNS, and AdGuadHome so that I can learn ceph.

With applications that are not I/O intensive like this, it is fine to run ceph with three OASLOA mini PCs with the Intel N95 processor, 16 GB of RAM per node, and a 2242 M.2 512 GB NVMe SSD over dual GbE NICs.

It’s fine.

I also use erasure coding rather than the default replicate rule that the Proxmox GUI provides to you.

As such, it’s a little bit better.

My biggest problem with ceph is that it has a significant performance problem, especially if you’re using erasure coding CRUSH rule that too hard to ignore, that’s inherent of ceph.

(Hint: think/ask yourself “what does ceph do for read caching?” Then ask yourself, “what can I do to claw some of that (performance degradation) back?”)

I agree. ZFS is the ideal choice for most small-scale cluster configurations. Anyway, I know people who set up shared storage using InfiniBand

Tom, when you mentioned a dedicated storage server, what exactly are you thinking of? Is it something based on NFS, where all the nodes can read the same disk, or is it block storage somehow? or are you thinking of something else? That part of the narrative wasn’t clear for me, and I have been considering a Proxmox cluster.

A shared storage server would be something like TrueNAS as an NFS storage target as it’s a simple option.

1 Like

What you said above also applies to Harvester HCI, except the fact that shared storage (NFS) does not offer live migration. Not sure why they don’t have that part working, and maybe in version 1.8.0 they have more of that figured out, but the latest version only dropped yesterday and I haven’t had time to read up on everything yet.

One thing not mentioned above is that if using an HCI configuration, you may need more CPU to handle all the data back and forth. In testing with nvme disks and a 10g network, I get a certain performance out of Longhorn, and it is nearly maximum on the management network. When I moved that to 25g, there really wasn’t a change in speeds, even though I know the nvme drives are capable of MUCH more. Working theory is that the Longhorn workers are out of CPU because I have an underpowered system in my lab. I would expect similar with CEPH since each has a worker process (and maybe I’ll try it some day).

I’ve set up shared storage over 100 Gbps IB. Other than the fact that the “inbox” nfs-kernel-server package on Debian based distros doesn’t natively support NFSoRDMA, whereas the nfs server package for RHEL-derived distros natively supports NFSoRDMA, it works.

uh…..again, it depends on your workload.

If you have a fair bit of data traversing over the network, then maybe.

But one of my mini PCs in my HA Proxmox/ceph cluster runs Plex, so when it is scanning my “do-it-all” Proxmox server where the media is hosted, there’s a fair bit of CPU/disk/network I/O, but most of that is actually just the Plex scanning process itself.

KISS philosophy….

if u can afford to be down for an hour or so ….. stick to just two unlinked pmox servers, both linked to a separate pbs server with the datastores on a mirrored zfs volume, even better a pbs vm running on truenas (snapshotting remotely)

making sure you religiously check up on your verify and pruning jobs :slight_smile:

Thank you for posting this. I have been using CEPH for high availability. It works great, but I have no clue what to do if something breaks. This an excellent starting point to try other modes of HA.

1 Like