Live migrations in Proxmox 8.2.4 over 100 Gbps Infiniband

Not really asking for help – this is more like a project or a build log (if you can call it that, of sorts).

Set up the 3-node Proxmox cluster about 4 hours ago. That took a little bit because the Proxmox 8.2 installer wouldn’t boot up with my 3090 installed in my 5950X system, so I had to swap it out with a GTX 660 instead, to get the install going. (5950X doesn’t have an iGPU, hence the need for a dGPU.)

The 7950X is a Supermicro AS-3015I-A “server”, which also uses the Supermicro H13SAE-MF motherboard, so it has built-in IPMI, so no dGPU was needed for that.

Two nodes are 5950Xes, one 7950X.

They all have 128 GB of RAM (DDR4, DDR4, DDR5 respectively) and Silicon Power US70 1 TB NVMe 4.0 x4 SSD for the 5950X systems, and an Intel 670p 2 TB NVMe 3.0 x4 SSD in the 7950X.

All of them have a Mellanox ConnectX-4 dual port VPI 100 Gbps Infiniband NIC (MCX456A-ECAT) in them, connected to a Mellanox 36-port externally managed Infiniband switch (MSB-7890). My main Proxmox server is running the OpenSM subnet manager that ships with Debian.

Got that clustered up over IB.

And then installed Ceph Quincy 17.2 and got the erasure-coded pools up and running.

I can live-migrate a Win11 VM (with 16 GB of RAM provisioned to it) in 21 seconds after learning how to make it use a specific network for the live migration.

edit
I can live-migrate an Ubuntu 24.04 VM (which also has 64 GB of RAM provisioned to it) in 10 seconds at an average migration speed of 12.8 GiB/s (102.4 Gbps)

edit #2
Here is a live migration of a Win11 VM with 64 GiB of RAM provisioned to it. (16 GiB/s = 128 Gbps)

edit #3
Three things:

  1. In late testing last night/early this morning – I did manage to hit a peak of 16.7 GiB/s (i.e. 133.6 Gbps) during a live migration of the Win11 VM. (I also set up an Ubuntu 24.04 VM to play with live migration as well, and got basically the same speeds.)

  2. Live migrating from the 7950X to the 5950X is where I can hit these speeds. Live migrating where the source comes from the 5950X node(s), tended to peak out at around 11.1 GiB/s (88.8 Gbps).

So, it’s kind of crazy (which NO ONE has tested, nor talked about) how the generational improvement from 5950X to 7950X can result in upto a 33.5% bandwidth improvement for high speed system interconnect communications.

The Mellanox ConnectX-4 cards are only PCIe 3.0 x16 cards, so the fact that the 5950X has PCIe 4.0 and the 7950X has PCIe 5.0 – is completely irrelevant here. The NIC can’t use the newer PCIe generations.

Which means that even when limited by PCIe 3.0, the 7950X is STILL faster than the 5950X.

  1. The amount of RAM that has been provisioned to the VM makes a difference.

Originally, the VMs only had 16 GB, and so, the migration cachesize appears to be set to be 1/8th of that (so 2 GB), which meant that at best, I was only getting like maybe 5-6-ish GB/s during said live migrations. It was difficult to push much further beyond that.

But when I bumped the RAM up to 64 GB, the migration cachesize window (still being 1/8th of the RAM) now opened up to 8 GB, and this is where I started seeing these significantly faster live migration speeds. (I didn’t spend a lot of time studying the effect of RAM size to migration speed.)

I tried this morning, bumping that up to 100 GB of RAM, and that actually ended up being slightly slower at 14-ish GiB/s. Still decent and respectable, but it looks like that it was just a little further, as that’s still ~112 Gbps.

2 Likes

Other than having the parts, what made you decide on infiniband at 100gbps?

So the short story as to why I have 100 Gbps IB running in the basement of my home is because Linus Tech Tips had a video which showed that you can have a 100 Gbps point-to-point connection for a few hundred bucks at the time (which, compared to retail pricing for 100 Gbps IB gear, is like 1/2 price!).

That was actually the true origin story as to why I have it running in the basement of my home.

(There is, of course, a longer version to this story.)

I guess that beats my few hundred bucks for 10gbps by a factor of ten. It would be nice to have more speed, but with most of my lab operating at slower than 10gbps, it’s probably not worth doing.

The spinning drives in the NAS are the real choke point right now, I just can’t justify the odd few hundred $ to upgrade them to SATA SSD with the hopes that things will be faster.

So yeah…in terms of absolute cost, the slower speeds are still cheaper, overall, on an absolute basis.

On a $/Gbps basis, the 100 Gbps IB ended up being the cheaper and thus, more cost efficient option.

And I choose IB because even though my Mellanox ConnectX-4 cards can do either ETH or IB, the 100 GbE switches back then (and even now), are still more expensive vs. their IB counterparts, especially once you factor in total switching capacity.

(My 36 port IB switch can do a total of 7.2 Tbps switching capacity (3.6 Tbps, full duplex/bidirectionally).)

You can get cheaper 100 GbE (e.g. Mikrotik has the 4 port 100 GbE switch for like $700 I think (something like that), but that’s only 400 Gbps at $700 = $1.75/Gbps whereas my 36-port Mellanox 100 Gbps IB switch was $2950 CAD at the time (so let’s call that around $2269 USD if 1 USD = 1.3 CAD), but it’s for 3.6 Tbps (single direction) which works out to be $0.63034188/Gbps.

So that’s why I went the IB route rather than ethernet.

But yes, a lot of times, people look to these really fast system interconnect technologies for storage which is probably not really a great use case for it unless the rest of the bottlenecks are largely alleviated such that the system interconnect remains the bottleneck (still).

For most people, that’s not the case.

Even my main Proxmox server consists of 36 HDDs/spinning rust drives.

But for HPC/CFD/FEA/CAE applications, these can really take advantage of it because often times, you will be running one problem/case/solution across MULTIPLE nodes in a cluster (I had my own micro HPC cluster at home and the longer version of my story is that I purchased the 100 Gbps Infiniband system really for this. So the LTT video showed me that the prices have dropped SIGNIFICANTLY since it launched. (I used to browse Colfax Direct regularly to look at how much I can’t buy.) And that’s what kicked off my research to go find out how much it would cost to deploy said 100 Gbps IB in the basement of my home, and did so relatively cost efficiently.

Thus for me, being able to also use it for storage is just a fringe benefit/perk. That wasn’t the original intent of the acquisition though.

At the cost of the switch, I don’t feel bad now. My switching was $200 used with ten 1/10gbps SFP+ ports and an extra 10/100/1000 port off the CPU. It’s a Mikrotik CRS309-xx-xx. I also have some used Cisco 2960s with the SFP+ that I can bring out for expansion and things that need POE+.

I only have 8 spinning drives and they are pretty slow when I test the “C” drive on a VM. My production system is only slightly faster, and still with 8 drives (newer faster drives, newer faster interface). My whole lab is what someone working in an IT space would consider old junk, but for home lab or even work lab, it seems to be able to handle a lot, just doing it a little slowly. Some of the speed might be from running Router OS on the CRS309, I need to make a change to Switch OS and see how the speed changes. Eventually I’ll get to that, other things on fire right now that I need to address.

1 Like

Yeah – I skipped all of the other speeds and jumped straight from GbE to 100 Gbps.

Over time, my lab grew and grew. But I mean, I’ve been doing this since at least like 2007, so I’ve been “building” my homelab for like 17 years.

Hello,

We are running a Proxmox 8.4 cluster with Ceph, using Mellanox ConnectX-4 VPI MCX456A-ECAT adapters and a Mellanox MSB7800-ES2F 36-Port 100Gb QSFP28 EDR switch. The interfaces are negotiating at 100Gbps, however, in testing we are only able to achieve a maximum throughput of ~54Gbps.

Could you please advise if there are any recommended tuning parameters, firmware updates, or configuration changes we should apply in order to reach closer to line-rate performance? Any guidance or best practices would be greatly appreciated.

Three things:

  1. It depends on what you are using for testing. I have found that iperf3 performs very poorly for trying to confirm that you are able to attain 100 Gbps bandwidth (or close to it). In my testing, I have found that iperf tended to work better, but I did need to push around 8 parallel streams to hit that number.

  2. If you are worried or thinking that you’re not hitting close to the 100 Gbps Infiniband speeds at all, then the ib_send_bw might be a better tool for you to test that as it uses the actual Infiniband verbs for data transfer rather than more conventional ethernet/TCP/UDP for data transmission.

If you run ib_send_bw and you’re not getting anywhere close to 100 Gbps (you should be able to hit at least 94-98 Gbps regularly), then I would check to make sure that your NIC are seated in the slot properly, as well as your cables.

If those check out, then the other thing that I would check would be in the BIOS of the systems, to make sure that the PCIe slot is at least running at GEN3 speeds (or on auto setting, which most of the time, it will auto-negotiate to Gen3 speeds).

If that all checks out, then the other thing that I would also check either using lspci –vv or something similar, is to check that it actually established a PCIe 3.0 x16 link speed. If you physically plugged the NIC into a x16 slot, but the slot itself is wired for x8, electrically, then that might help to explain why you aren’t getting the full bandwidth.

(And on systems that either have a limited number of slots, lanes, is older, and/or if you have a lot of PCIe add-in cards, then you can run into situations where you ran out of lanes/slots, or on some consumer grade motherboards, the PCIe slot furthest from the CPU socket may physically be a x16 slot, but electrically, might only be a x4 slot.)

So you can check that.

Lastly, since you have the MSB7800, you can also either ssh in to the switch itself, to confirm that it negotiated a 100 Gbps link (or via the MLNX-OS web UI).

You can find instructions on how to do that by googling for the MLNX-OS user’s manual from nvidia’s website.

For the Ceph cluster, it will depend on whether you’re using HDDs (SAS/SATA) or SSDs (M.2 NVMe? U.2 NVMe? E1SFF? SAS? SATA?).

It will also depend on how many nodes you have, in your cluster, and also how many OSDs you have as well.

More OSDs tend to perform better. (Sometimes.)

And of course, having faster storage (i.e. not HDDs) also tend to help the whole thing run faster as well. (You can run a Ceph cluster, with HDDs, but you will need a whole army of HDDs to be able to push anywhere close to 100 Gbps over IB.)

I don’t know if Ceph has a RDMA option or not (never spent that much time testing it as NVMe SSDs were fast enough, and I was working off of a shared pool, so LXC/VM migrations were already super quick because it just had to migrate the live memory state over, and thanks to compression, I was able to achieve 133 Gbps out of 100 Gbps.)

But if RDMA is an option, you’ll want to enable that.

Do note that, for example, NFSoRDMA is predominantly supported on RHEL based Linux distros. Debian-based Linux distros doesn’t really support NFSoRDMA that well, “out of the box”.

You can also, always install the Mellanox version of the driver (yes, I still call them Mellanox), as that might give you NFSoRDMA capabilities vs. where some “inbox” drivers might not support that, but you may, potentially, run the risk where Nvidia/Mellanox will remove that feature, in the future, without notice.

Quote: “Our own driver, Mellanox OFED, does not support NFSoRDMA.” (Source: here)

Since then, I’ve tried to stick with the “inbox” driver, as much as possible.

But hopefully, this might give you at least some suggestions for things to test/try/look out for.

There are also some systems, that if you set the LINK_TYPE_P2=2, the driver might not understand that the link itself is actually comprised of four 25 Gbps links (EDR). As a result, if I set one of the VPI ports to be ETH, and then try to use it in a Linux Network Bridge, in Proxmox, the best that I was able to get out of that was 23 Gbps (out of 25 Gbps), because of this.

So yeah, sometimes, using IB results in some funkiness, especially with Debian-based distros.

No, I haven’t because most consumer-grade platforms only support 128 GB of RAM max. It’s only very recently that newer consumer-grade platforms may support > 128 GB, but via the testing that Wendell was doing with 256 GB kit of DDR5 memory, support for that isn’t always necessarily guaranteed, especially if you want to hit the rated speeds that’s printed on the box of said memory kit (e.g. DDR5-6000).

I only have one system (total) that has > 128 GB of RAM (my main “do-it-all” Proxmox server, which has 768 GB of RAM installed), but running that test wouldn’t really be the same as running the same test, between nodes (vs. within the same “pizza box”).

I will say/add this though:

Per Wendell, in his recent video where he’s testing 256 GB kits, he did say that you can get it to run, at the JEDEC speeds of DDR5-3600. Of course, that’s super slow (vs. what’s printed on the box), and so, most of his testing was centered around actually getting the EXPO profile to work properly.

Therefore; theoretically speaking, if the difference in RAM speed is largely immaterial to you, then you can, in theory, just run it at the DDR5-3600 speeds and call it a day.

(I run my 7950X system, I think at DDR5-4800 speeds with 128 GB of RAM. I think the kit itself may actually have been qualified to run faster, but I’m after capacity and stability rather than sheer speed (which may result in instability.)

I was originally looking at potentially getting a dual AMD EPYC system with 1 TB of RAM, but we ended up not selling our house for as much as I was hoping to be able to sell said house, and so, that is getting shelved, again.

C’est la vie.

For now, I have two 5950X nodes, a 7950X node, and four dual Xeon E5-2690 (v1) nodes, and two HP Z420 tower workstations (also with E5-2690 (v1)) that all have 128 GB of RAM. So in theory, I can just use what I have for the time being, which between the 9 systems, = 1,152 GB of RAM.

Of course the older systems aren’t as efficient/performant, but if I am running something that has a high-ish RAM requirement, and if the task can be distributed between multiple nodes, then my cluster is technically an option.

Hi, we’ve achieved a stable performance of 90 Gbps using the InfiniBand protocol. I believe this limit cannot be exceeded due to the maximum MTU size of 4096 supported by this switch. However, we’re facing another issue — on the Ceph cluster, we’re unable to reach the expected performance levels when running tests inside virtual machines. 4xNotes 2xXeon 1TB RAM, 6xU2 NVME + 100gbps netcards.

IB MTU also depends on whether you’re using the datagram mode or connected mode.

So…it can depend a little bit on what you mean by this:

  1. Are the VMs using the 100 Gbps IB NIC (i.e. you passed the NIC through to your VMs and/or you’re using SRIOV on IB (e.g. VFs)) or does the (presumably Proxmox) host have control/domain over the IB NIC?

  2. Are you using the Debian “inbox” drivers for your IB NICs or are you using the MLNX_OFED drivers?

  3. Do the disks for your VMs reside on the nodes under local storage or do the VM disks reside on the Ceph cluster?

I ask this because if the VM disks are on the nodes, and you’re trying to migrate it to the Ceph cluster, that can be quite slow, by relative comparison. (In practice, with M.2 PCIe 4.0 x4 NVMe SSDs, I can only move my VM disks at around like 1-2 GB/s (< 20 Gbps) nominally.

But once the VM disk lives on the Ceph cluster, migration is only about migrating the memory state of the VM, and with compression, you can have > 100 Gbps transfer rates as shown above. (since it technically doesn’t have to move the VM disk around, because it already lives on network shared storage).

To the best of my knowledge, and what I’ve been able to find, Ceph still doesn’t support ib_verbs for data transfer (e.g. no Ceph-over-RDMA). Therefore; as a result of that, this may contribute to why you might not be seeing the expected IB performance out of the system vs say, if you were just running ib_send_bw.