RaidZ(X) throughput limit?

rbrocks84 · December 28, 2023, 6:41pm

Hello All.

I was wondering if anybody knew of a throughput throttle/bottleneck in ZFS RaidZ(X). I have been experiencing the following:

System 1

Dell T620
- Dual Xeon E5-2697 V2 procs
- 768GB DDR3 LR-DIMMs, all slots populated with 32GB DIMMs
- Fully up to date Proxmox VE
- 8 3.84TB Samsung Enterprise SSDs
- RaidZ2
- Connected via 2 HBAs each connected to half of the SSDs
- Maximum Throughput 1.25GB/sec

System 2

Custom SuperMicro Chassis
- Dual Xeon 8180M procs
- 768GB DDR4 LR-DIMMs, all slots populated with 64GB DIMMs
- Fully up to date Proxmox VE
- 508 22TB Western Digital Near Line SAS drives
- RaidZ3
  - 3 pools
  - 1 vdev per pool
  - 170 drives in 2 pools and 168 in the third
- Connected via LSI 9300-8e HBA
- Maximum throughput 1GB/sec

On the T620, I know from experience that the hardware can do WAY more than 1.25GB/sec as I was once running the system using S2D under Windows and it was capable of almost 3x the total throughput in a mirror configuration. I also know that performance of mirror != performance of parity. I would never expect parity of performance between the 2. I find it strange though that RaidZ2 on the older system is only marginally faster than RaidZ3 on the newer one.

On the SuperMicro, I know that one can only expect the IOPs of a single drive in a RaidZ(X). That isn’t my problem. Also, I know that the more drives in a pool, the more likely that pool is to fail. The data in those pools would be annoying to regenerate, and having to do so would cause some loss of revenue, but it wouldn’t be the end of the world. The point of the large vdevs is to reduce the likelihood of lost data while maximizing available space and reducing administrative overhead. The thing that I’ve been struggling with on this system is that the total SYSTEM throughput in to any RaidZ3 pool is 1GB/sec. I can attain this by reading/writing to any single pool. If I am reading/writing from/to more than 1 pool, that throughput is shared between the pools. While this particular use case won’t have much of an issue once the space is full under normal operating conditions only having that kind of throughput, it is severely hampering my ability to fill the space since I can only write at 1GB/sec. Assuming I don’t have a resilver or scrub going, that means that it will take almost 4 months to fill the space, when I have supporting infrastructure which would allow me to fill it at 5x that pace. Also, if I have a resilver/scrub going at the same time that I’m trying to write data, using that data slows significantly to the point where revenue generation basically grinds to a halt. This is also true if the write is going to 1 pool, the resilver/scrub is going to another pool, and I limit revenue generation to the third pool, which should not happen in my view.

Also, does anybody know if there is anything I need to do to get the HBA to use active optical cables to the first JBOD other than flash the firmware for AOCs referenced in the documentation? Given the number of JBODs and drives, I’ve had some signal integrity issues using copper and I’d like to use AOCs throughout the whole system, rather than just between the JBODs.

Any assistance anyone can provide would be greatly appreciated.

LTS_Tom · December 28, 2023, 10:44pm

You should not have that many drives per VDEV. They should be broken up into 9-12 drives per VDEV which will perform much better. Leaving them in one large VDEV will likely cause issues if a drive had to be replaced.

There is an in depth forum post here about ZFS setup and performance:

rbrocks84 · December 29, 2023, 2:27am

Hey Tom. Thanks for the reply. Your input is much appreciated.

I read the forum post you refer to as well as anything else I could find on the Internet on the subject before posting. I much prefer to find solutions to problems on my own, but I’m out of my depth on this one and I’m not ashamed to admit it. I would tend to agree with your assessment that the vdev shouldn’t be so large, under normal circumstances. In this case, the norm is to just use bare drives. As you can imagine, 508 drives, directories, mounts… per system… pretty insane in terms of administrative overhead… FYI, it should actually be 510 but 2 drives aren’t showing up for some reason and I haven’t had the time to try to track them down. Also, with the small amounts of wasted space over that number of drives since the files won’t perfectly fill the drives, most of the lost space to redundancy is recovered by increased efficiency. My goal with this particular system was to maximize available space while minimizing administrative overhead and giving me at least some redundancy so as to minimize having to regenerate data from bit flips and the like. It’s a “fill the space once and read forever” kind of workload, so ZFS’s resistance to bit rot made it a no brainer choice. Once the space is full, the steady state involves a handful of reads every 10 seconds with no writes. It’s a very light load from an IO performance perspective. In terms of performance from a single vdev/pool in my situation, I’m actually happy enough as it is. The issue comes in when I start using more than 1 pool. As I said in my original post, throughput on a system level seems to be limited to about 1GB/sec, which can be attained from any single vdev/pool. As soon as I start interacting with more than 1 pool, the throughput is split among them. For this system, the issue is more of an inconvenience/nuisance, though I am looking forward to when I get more capital to be able to build another system with the same hardware (only perhaps with the 30+TB Seagate drives coming to market soon™), where I plan to create 51 RaidZ2 pools with each consisting of a single vdev with 10 disks, for a different workload, where performance does very much matter. I’d need to be able to get at least around 40gb/sec aggregate throughput from that configuration, but if the whole system will be limited to about 10gb/sec, that will be a major problem. If I can get the issue figured out on the current system it would not only effect this system, but also the T620 I spoke of in my original post as well as a T610 with an 8 drive Z2 which is also experiencing the same bottleneck (forgot to mention it in the original post), and allow me to hit the ground running with the next system. I’m wondering if this may be a Debian/Proxmox issue, or if I have found a limitation in ZoL… As I can attain the max the system will do (and none of the systems in question are being stressed in the slightest) with a single Z3 vdev on the SuperMicro chassis, and similar throughput on older systems each with only 8 drives in their vdevs under Z2, I would tend to think it’s somewhere other than the number of devices in the vdev, but you’re much more of an expert both on linux and on ZoL than I am… I used to love the Microsoft (and to a lesser extent, Cisco) coolaid, but have been working for the past 18 months or so to ditch them everywhere possible for open source alternatives, which almost across the board work better… Any ideas as to why I have this system level bottleneck, which is similar in limit between systems utilizing X5600 to Gen 2 Xeon Scalable procs? It almost feels like there is an arbitrary throughput limit coded somewhere in ZoL, Debian, or Proxmox. In my mind, if it were a physical system level limitation, it would be very different between such a huge difference in system age, particularly since the T610 is PCIeG1…