From ffd2e15d656e47f8d82ffbe46af1f0899ff889e5 Mon Sep 17 00:00:00 2001 From: George Wilson Date: Fri, 2 Dec 2022 19:46:29 -0600 Subject: [PATCH] zio can deadlock during device removal When doing a device removal on a pool with gang blocks, the zio pipeline can deadlock when trying to free blocks from a device which is being removed with a stack similar to this: 0xffff8ab9a13a1740 UNINTERRUPTIBLE 4 __schedule+0x2e5 __schedule+0x2e5 schedule+0x33 schedule_preempt_disabled+0xe __mutex_lock.isra.12+0x2a7 __mutex_lock.isra.12+0x2a7 __mutex_lock_slowpath+0x13 mutex_lock+0x2c free_from_removing_vdev+0x61 metaslab_free_impl+0xd6 metaslab_free_dva+0x5e metaslab_free+0x196 zio_free_sync+0xe4 zio_free_gang+0x38 zio_gang_tree_issue+0x42 zio_gang_tree_issue+0xa2 zio_gang_issue+0x6d zio_execute+0x94 zio_execute+0x94 taskq_thread+0x23b kthread+0x120 ret_from_fork+0x1f Since there are gang blocks we have to read the gang members as part of the free. This can be seen with a zio dependency tree that looks like this: sdb> echo 0xffff900c24f8a700 | zio -rc | zio ADDRESS TYPE STAGE WAITER 0xffff900c24f8a700 NULL CHECKSUM_VERIFY 0xffff900ddfd31740 0xffff900c24f8c920 FREE GANG_ASSEMBLE - 0xffff900d93d435a0 READ DONE In the illustration above we are processing frees but because of gang block we have to read the constituents blocks. Once we finish the READ in the zio pipeline we will execute the parent. In this case the parent is a FREE but the zio taskq is a READ and we continue to process the pipeline leading to the stack above. In the stack above, we are blocked waiting for the svr_lock so as a result a READ interrupt taskq thread is now consumed. Eventually, all of the READ taskq threads end up blocked and we're unable to complete any read requests. In zio_notify_parent there is an optimization to continue to use the taskq thread to exectue the parent's pipeline. To resolve the deadlock above, we only allow this optimization if the parent's zio type matches the child which just completed. Reviewed-by: Brian Behlendorf Reviewed-by: Matthew Ahrens Signed-off-by: George Wilson External-issue: DLPX-80130 Closes #14236 --- module/zfs/zio.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/module/zfs/zio.c b/module/zfs/zio.c index 2ae42e2df4..9ae2458669 100644 --- a/module/zfs/zio.c +++ b/module/zfs/zio.c @@ -724,7 +724,9 @@ zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait, /* * If we can tell the caller to execute this parent next, do - * so. Otherwise dispatch the parent zio as its own task. + * so. We only do this if the parent's zio type matches the + * child's type. Otherwise dispatch the parent zio in its + * own taskq. * * Having the caller execute the parent when possible reduces * locking on the zio taskq's, reduces context switch @@ -743,7 +745,8 @@ zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait, * parent-child relationships, as we do with the "mega zio" * of writes for spa_sync(), and the chain of ZIL blocks. */ - if (next_to_executep != NULL && *next_to_executep == NULL) { + if (next_to_executep != NULL && *next_to_executep == NULL && + pio->io_type == zio->io_type) { *next_to_executep = pio; } else { zio_taskq_dispatch(pio, type, B_FALSE);