zfs/module
Serapheim Dimitropoulos 7bf4c97a36
Bypass metaslab throttle for removal allocations
Context:
We recently had a scenario where a customer with 2x10TB disks at 95+%
fragmentation and capacity, wanted to migrate their disks to a 2x20TB
setup. So they added the 2 new disks and submitted the removal of the
first 10TB disk.  The removal took a lot more than expected (order of
more than a week to 2 weeks vs a couple of days) and once it was done it
generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB).

Root-Cause:
The removal code calls `metaslab_alloc_dva()` to allocate a new block
for each evacuating block in the removing device and it tries to batch
them into 16MB segments. If it can't find such a segment it tries for
8MBs, 4MBs, all the way down to 512 bytes.

In our scenario what would happen is that `metaslab_alloc_dva()` from
the removal thread pick the new devices initially but wouldn't allocate
from them because of throttling in their metaslab allocation queue's
depth (see `metaslab_group_allocatable()`) as these devices are new and
favored for most types of allocations because of their free space. So
then the removal thread would look at the old fragmented disk for
allocations and wouldn't find any contiguous space and finally retry
with a smaller allocation size until it would to the low KB range. This
caused a lot of small mappings to be generated blowing up the size of
the indirect table. It also wasted a lot of CPU while the removal was
active making everything slow.

This patch:
Make all allocations coming from the device removal thread bypass the
throttle checks. These allocations are not even counted in the metaslab
allocation queues anyway so why check them?

Side-Fix:
Allocations with METASLAB_DONT_THROTTLE in their flags would not be
accounted at the throttle queues but they'd still abide by the
throttling rules which seems wrong. This patch fixes this by checking
for that flag in `metaslab_group_allocatable()`. I did a quick check to
see where else this flag is used and it doesn't seem like this change
would cause issues.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14159
2022-12-09 10:48:33 -08:00
..
avl Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
icp Fix Clang 15 compilation errors 2022-11-30 13:46:26 -08:00
lua Fix Clang 15 compilation errors 2022-11-30 13:46:26 -08:00
nvpair Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
os Linux: Cleanup unnecessary NULL check in __vdev_disk_physio() 2022-12-08 13:52:47 -08:00
unicode Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zcommon Micro-optimize fletcher4 calculations 2022-12-05 11:00:34 -08:00
zfs Bypass metaslab throttle for removal allocations 2022-12-09 10:48:33 -08:00
zstd zstd: Refactor prefetching for the decoding loop 2022-11-29 10:05:30 -08:00
.gitignore FreeBSD: Ignore symlink to i386 includes 2022-08-02 16:34:23 -07:00
Kbuild.in Fix Clang 15 compilation errors 2022-11-30 13:46:26 -08:00
Makefile.bsd Cleanup dead spa_boot code 2022-09-13 16:40:10 -07:00
Makefile.in autoconf: use include directives instead of recursing down lib 2022-05-10 10:18:11 -07:00