initial commit, migrating from draft status

2018-09-01 17:01:21 -07:00 · 2018-09-01 17:01:21 -07:00 · 5344d899e4
parent 6eeeba662d
commit 5344d899e4
1 changed files with 98 additions and 0 deletions
--- a/ZFS-Transaction-Delay.md
+++ b/ZFS-Transaction-Delay.md
@ -0,0 +1,98 @@
+### ZFS Transaction Delay
+
+ZFS write operations are delayed when the 
+backend storage isn't able to accommodate the rate of incoming writes.
+This delay process is known as the ZFS write throttle.
+
+If there is already a write transaction waiting, the delay is relative to 
+when that transaction will finish waiting. Thus the calculated delay time
+is independent of the number of threads concurrently executing
+transactions.
+
+If there is only one waiter, the delay is relative to when the transaction
+started, rather than the current time. This credits the transaction for
+"time already served." For example, if a write transaction requires reading 
+indirect blocks first, then the delay is counted at the start of the 
+transaction, just prior to the indirect block reads.
+
+The minimum time for a transaction to take is calculated as:
+```
+min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
+min_time is then capped at 100 milliseconds
+```
+
+The delay has two degrees of freedom that can be adjusted via tunables:
+1. The percentage of dirty data at which we start to delay is defined by
+zfs_delay_min_dirty_percent. This is typically be at or above
+zfs_vdev_async_write_active_max_dirty_percent so delays occur
+after writing at full speed has failed to keep up with the incoming write
+rate. 
+2. The scale of the curve is defined by zfs_delay_scale. Roughly speaking,
+this variable determines the amount of delay at the midpoint of the curve.
+
+```
+delay
+ 10ms +-------------------------------------------------------------*+
+      |                                                             *|
+  9ms +                                                             *+
+      |                                                             *|
+  8ms +                                                             *+
+      |                                                            * |
+  7ms +                                                            * +
+      |                                                            * |
+  6ms +                                                            * +
+      |                                                            * |
+  5ms +                                                           *  +
+      |                                                           *  |
+  4ms +                                                           *  +
+      |                                                           *  |
+  3ms +                                                          *   +
+      |                                                          *   |
+  2ms +                                              (midpoint) *    +
+      |                                                  |    **     |
+  1ms +                                                  v ***       +
+      |             zfs_delay_scale ---------->     ********         |
+    0 +-------------------------------------*********----------------+
+      0%                    <- zfs_dirty_data_max ->               100%
+```
+
+Note that since the delay is added to the outstanding time remaining on the
+most recent transaction, the delay is effectively the inverse of IOPS.
+Here the midpoint of 500 microseconds translates to 2000 IOPS. 
+The shape of the curve was chosen such that small changes in the amount of 
+accumulated dirty data in the first 3/4 of the curve yield relatively small 
+differences in the amount of delay.
+
+The effects can be easier to understand when the amount of delay is
+represented on a log scale:
+```
+delay
+100ms +-------------------------------------------------------------++
+      +                                                              +
+      |                                                              |
+      +                                                             *+
+ 10ms +                                                             *+
+      +                                                           ** +
+      |                                              (midpoint)  **  |
+      +                                                  |     **    +
+  1ms +                                                  v ****      +
+      +             zfs_delay_scale ---------->        *****         +
+      |                                             ****             |
+      +                                          ****                +
+100us +                                        **                    +
+      +                                       *                      +
+      |                                      *                       |
+      +                                     *                        +
+ 10us +                                     *                        +
+      +                                                              +
+      |                                                              |
+      +                                                              +
+      +--------------------------------------------------------------+
+      0%                    <- zfs_dirty_data_max ->               100%
+```
+Note here that only as the amount of dirty data approaches its limit does
+the delay start to increase rapidly. The goal of a properly tuned system
+should be to keep the amount of dirty data out of that range by first
+ensuring that the appropriate limits are set for the I/O scheduler to reach
+optimal throughput on the backend storage, and then by changing the value
+of zfs_delay_scale to increase the steepness of the curve.