Live migrations in Proxmox 8.2.4 over 100 Gbps Infiniband

alpha754293 · September 13, 2024, 7:37am

Not really asking for help – this is more like a project or a build log (if you can call it that, of sorts).

Set up the 3-node Proxmox cluster about 4 hours ago. That took a little bit because the Proxmox 8.2 installer wouldn’t boot up with my 3090 installed in my 5950X system, so I had to swap it out with a GTX 660 instead, to get the install going. (5950X doesn’t have an iGPU, hence the need for a dGPU.)

The 7950X is a Supermicro AS-3015I-A “server”, which also uses the Supermicro H13SAE-MF motherboard, so it has built-in IPMI, so no dGPU was needed for that.

Two nodes are 5950Xes, one 7950X.

They all have 128 GB of RAM (DDR4, DDR4, DDR5 respectively) and Silicon Power US70 1 TB NVMe 4.0 x4 SSD for the 5950X systems, and an Intel 670p 2 TB NVMe 3.0 x4 SSD in the 7950X.

All of them have a Mellanox ConnectX-4 dual port VPI 100 Gbps Infiniband NIC (MCX456A-ECAT) in them, connected to a Mellanox 36-port externally managed Infiniband switch (MSB-7890). My main Proxmox server is running the OpenSM subnet manager that ships with Debian.

Got that clustered up over IB.

And then installed Ceph Quincy 17.2 and got the erasure-coded pools up and running.

I can live-migrate a Win11 VM (with 16 GB of RAM provisioned to it) in 21 seconds after learning how to make it use a specific network for the live migration.

edit
I can live-migrate an Ubuntu 24.04 VM (which also has 64 GB of RAM provisioned to it) in 10 seconds at an average migration speed of 12.8 GiB/s (102.4 Gbps)

edit #2
Here is a live migration of a Win11 VM with 64 GiB of RAM provisioned to it. (16 GiB/s = 128 Gbps)

edit #3
Three things:

In late testing last night/early this morning – I did manage to hit a peak of 16.7 GiB/s (i.e. 133.6 Gbps) during a live migration of the Win11 VM. (I also set up an Ubuntu 24.04 VM to play with live migration as well, and got basically the same speeds.)
Live migrating from the 7950X to the 5950X is where I can hit these speeds. Live migrating where the source comes from the 5950X node(s), tended to peak out at around 11.1 GiB/s (88.8 Gbps).

So, it’s kind of crazy (which NO ONE has tested, nor talked about) how the generational improvement from 5950X to 7950X can result in upto a 33.5% bandwidth improvement for high speed system interconnect communications.

The Mellanox ConnectX-4 cards are only PCIe 3.0 x16 cards, so the fact that the 5950X has PCIe 4.0 and the 7950X has PCIe 5.0 – is completely irrelevant here. The NIC can’t use the newer PCIe generations.

Which means that even when limited by PCIe 3.0, the 7950X is STILL faster than the 5950X.

The amount of RAM that has been provisioned to the VM makes a difference.

Originally, the VMs only had 16 GB, and so, the migration cachesize appears to be set to be 1/8th of that (so 2 GB), which meant that at best, I was only getting like maybe 5-6-ish GB/s during said live migrations. It was difficult to push much further beyond that.

But when I bumped the RAM up to 64 GB, the migration cachesize window (still being 1/8th of the RAM) now opened up to 8 GB, and this is where I started seeing these significantly faster live migration speeds. (I didn’t spend a lot of time studying the effect of RAM size to migration speed.)

I tried this morning, bumping that up to 100 GB of RAM, and that actually ended up being slightly slower at 14-ish GiB/s. Still decent and respectable, but it looks like that it was just a little further, as that’s still ~112 Gbps.

Greg_E · September 13, 2024, 1:11pm

Other than having the parts, what made you decide on infiniband at 100gbps?

alpha754293 · September 13, 2024, 2:01pm

So the short story as to why I have 100 Gbps IB running in the basement of my home is because Linus Tech Tips had a video which showed that you can have a 100 Gbps point-to-point connection for a few hundred bucks at the time (which, compared to retail pricing for 100 Gbps IB gear, is like 1/2 price!).

That was actually the true origin story as to why I have it running in the basement of my home.

(There is, of course, a longer version to this story.)

Greg_E · September 13, 2024, 2:15pm

I guess that beats my few hundred bucks for 10gbps by a factor of ten. It would be nice to have more speed, but with most of my lab operating at slower than 10gbps, it’s probably not worth doing.

The spinning drives in the NAS are the real choke point right now, I just can’t justify the odd few hundred $ to upgrade them to SATA SSD with the hopes that things will be faster.

alpha754293 · September 13, 2024, 5:25pm

So yeah…in terms of absolute cost, the slower speeds are still cheaper, overall, on an absolute basis.

On a $/Gbps basis, the 100 Gbps IB ended up being the cheaper and thus, more cost efficient option.

And I choose IB because even though my Mellanox ConnectX-4 cards can do either ETH or IB, the 100 GbE switches back then (and even now), are still more expensive vs. their IB counterparts, especially once you factor in total switching capacity.

(My 36 port IB switch can do a total of 7.2 Tbps switching capacity (3.6 Tbps, full duplex/bidirectionally).)

You can get cheaper 100 GbE (e.g. Mikrotik has the 4 port 100 GbE switch for like $700 I think (something like that), but that’s only 400 Gbps at $700 = $1.75/Gbps whereas my 36-port Mellanox 100 Gbps IB switch was $2950 CAD at the time (so let’s call that around $2269 USD if 1 USD = 1.3 CAD), but it’s for 3.6 Tbps (single direction) which works out to be $0.63034188/Gbps.

So that’s why I went the IB route rather than ethernet.

But yes, a lot of times, people look to these really fast system interconnect technologies for storage which is probably not really a great use case for it unless the rest of the bottlenecks are largely alleviated such that the system interconnect remains the bottleneck (still).

For most people, that’s not the case.

Even my main Proxmox server consists of 36 HDDs/spinning rust drives.

But for HPC/CFD/FEA/CAE applications, these can really take advantage of it because often times, you will be running one problem/case/solution across MULTIPLE nodes in a cluster (I had my own micro HPC cluster at home and the longer version of my story is that I purchased the 100 Gbps Infiniband system really for this. So the LTT video showed me that the prices have dropped SIGNIFICANTLY since it launched. (I used to browse Colfax Direct regularly to look at how much I can’t buy.) And that’s what kicked off my research to go find out how much it would cost to deploy said 100 Gbps IB in the basement of my home, and did so relatively cost efficiently.

Thus for me, being able to also use it for storage is just a fringe benefit/perk. That wasn’t the original intent of the acquisition though.

Greg_E · September 13, 2024, 6:45pm

At the cost of the switch, I don’t feel bad now. My switching was $200 used with ten 1/10gbps SFP+ ports and an extra 10/100/1000 port off the CPU. It’s a Mikrotik CRS309-xx-xx. I also have some used Cisco 2960s with the SFP+ that I can bring out for expansion and things that need POE+.

I only have 8 spinning drives and they are pretty slow when I test the “C” drive on a VM. My production system is only slightly faster, and still with 8 drives (newer faster drives, newer faster interface). My whole lab is what someone working in an IT space would consider old junk, but for home lab or even work lab, it seems to be able to handle a lot, just doing it a little slowly. Some of the speed might be from running Router OS on the CRS309, I need to make a change to Switch OS and see how the speed changes. Eventually I’ll get to that, other things on fire right now that I need to address.

alpha754293 · September 14, 2024, 2:11am

Yeah – I skipped all of the other speeds and jumped straight from GbE to 100 Gbps.

Over time, my lab grew and grew. But I mean, I’ve been doing this since at least like 2007, so I’ve been “building” my homelab for like 17 years.