Xcp-ng SR / raid problems

Hi all

I have a bunch of 5900 servers with 2x2tb nvme drives which run “production” vms., This is bomb proof and I love it

However, in each server I also have 4x2tb sata drives to act as general storage / backup tragets etc These are configured as md0 and attached through XO as usual as EXT based storage.

However, these are horribly unstable. Out of 6 servers only 1 seems to be still working as expected. Some show endless coalesce chains that wont rebuild. Some will run VMs other wont. For a few I have simply had to destroy md0 and create md1 and rename the SR to “dead do not use” as I cant even unplug them. (so far reattaching as LVM seems to be working )

The crazy incident that caused me to reach out is I have a test windows vm setup on one of the arrays (no other disks). This VM can be stopped /started etc and works perfectly well (for a windows vm running on a raid 10 sata)

However, I still cant scan the SR..

SMlog shows

May 13 10:56:13 48000 SM: [1886528] ***** Local EXT3 VHD: EXCEPTION <class ‘xs_errors.SROSError’>, The SR scan failed [opterr=uuid=9847acc7-8268-4b01-856d-670d6256fff5]
May 13 10:56:13 48000 SM: [1886528] File “/opt/xensource/sm/SRCommand.py”, line 385, in run
May 13 10:56:13 48000 SM: [1886528] ret = cmd.run(sr)
May 13 10:56:13 48000 SM: [1886528] File “/opt/xensource/sm/SRCommand.py”, line 111, in run
May 13 10:56:13 48000 SM: [1886528] return self._run_locked(sr)
May 13 10:56:13 48000 SM: [1886528] File “/opt/xensource/sm/SRCommand.py”, line 161, in _run_locked
May 13 10:56:13 48000 SM: [1886528] rv = self._run(sr, target)
May 13 10:56:13 48000 SM: [1886528] File “/opt/xensource/sm/SRCommand.py”, line 370, in _run
May 13 10:56:13 48000 SM: [1886528] return sr.scan(self.params[‘sr_uuid’])
May 13 10:56:13 48000 SM: [1886528] File “/opt/xensource/sm/FileSR.py”, line 208, in scan
May 13 10:56:13 48000 SM: [1886528] self._loadvdis()
May 13 10:56:13 48000 SM: [1886528] File “/opt/xensource/sm/FileSR.py”, line 294, in _loadvdis
May 13 10:56:13 48000 SM: [1886528] raise xs_errors.XenError(‘SRScan’, opterr=‘uuid=%s’ % uuid)
May 13 10:56:13 48000 SM: [1886528]
May 13 10:56:13 48000 SM: [1886528] lock: closed /var/lock/sm/760e624c-4383-6326-9138-958c14d59030/sr

any ideas?

I have only used the Linux mdadm for the boot mirror and never tested with other drives. I have used ZFS for other drives:

this was going to be my next stop if LVM tests dont go well. Will zfs run fine with sata drives?

Yes, I did that video with NVME drives.

OK ZFS in place… will test to death