diff --git a/ZFS-Transaction-Delay.md b/ZFS-Transaction-Delay.md new file mode 100644 index 0000000..ff7312a --- /dev/null +++ b/ZFS-Transaction-Delay.md @@ -0,0 +1,98 @@ +### ZFS Transaction Delay + +ZFS write operations are delayed when the +backend storage isn't able to accommodate the rate of incoming writes. +This delay process is known as the ZFS write throttle. + +If there is already a write transaction waiting, the delay is relative to +when that transaction will finish waiting. Thus the calculated delay time +is independent of the number of threads concurrently executing +transactions. + +If there is only one waiter, the delay is relative to when the transaction +started, rather than the current time. This credits the transaction for +"time already served." For example, if a write transaction requires reading +indirect blocks first, then the delay is counted at the start of the +transaction, just prior to the indirect block reads. + +The minimum time for a transaction to take is calculated as: +``` +min_time = zfs_delay_scale * (dirty - min) / (max - dirty) +min_time is then capped at 100 milliseconds +``` + +The delay has two degrees of freedom that can be adjusted via tunables: +1. The percentage of dirty data at which we start to delay is defined by +zfs_delay_min_dirty_percent. This is typically be at or above +zfs_vdev_async_write_active_max_dirty_percent so delays occur +after writing at full speed has failed to keep up with the incoming write +rate. +2. The scale of the curve is defined by zfs_delay_scale. Roughly speaking, +this variable determines the amount of delay at the midpoint of the curve. + +``` +delay + 10ms +-------------------------------------------------------------*+ + | *| + 9ms + *+ + | *| + 8ms + *+ + | * | + 7ms + * + + | * | + 6ms + * + + | * | + 5ms + * + + | * | + 4ms + * + + | * | + 3ms + * + + | * | + 2ms + (midpoint) * + + | | ** | + 1ms + v *** + + | zfs_delay_scale ----------> ******** | + 0 +-------------------------------------*********----------------+ + 0% <- zfs_dirty_data_max -> 100% +``` + +Note that since the delay is added to the outstanding time remaining on the +most recent transaction, the delay is effectively the inverse of IOPS. +Here the midpoint of 500 microseconds translates to 2000 IOPS. +The shape of the curve was chosen such that small changes in the amount of +accumulated dirty data in the first 3/4 of the curve yield relatively small +differences in the amount of delay. + +The effects can be easier to understand when the amount of delay is +represented on a log scale: +``` +delay +100ms +-------------------------------------------------------------++ + + + + | | + + *+ + 10ms + *+ + + ** + + | (midpoint) ** | + + | ** + + 1ms + v **** + + + zfs_delay_scale ----------> ***** + + | **** | + + **** + +100us + ** + + + * + + | * | + + * + + 10us + * + + + + + | | + + + + +--------------------------------------------------------------+ + 0% <- zfs_dirty_data_max -> 100% +``` +Note here that only as the amount of dirty data approaches its limit does +the delay start to increase rapidly. The goal of a properly tuned system +should be to keep the amount of dirty data out of that range by first +ensuring that the appropriate limits are set for the I/O scheduler to reach +optimal throughput on the backend storage, and then by changing the value +of zfs_delay_scale to increase the steepness of the curve.