zfs/include/sys
Serapheim Dimitropoulos 93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00
..
crypto Add support for selecting encryption backend 2018-08-02 11:59:24 -07:00
fm Add zpool status -s (slow I/Os) and -p (parseable) 2018-11-08 16:47:24 -08:00
fs Log Spacemap Project 2019-07-16 10:11:49 -07:00
lua Fix coverity defects: zfs channel programs 2018-02-20 11:19:42 -08:00
sysevent Add TRIM support 2019-03-29 09:13:20 -07:00
Makefile.am Log Spacemap Project 2019-07-16 10:11:49 -07:00
abd.h single-chunk scatter ABDs can be treated as linear 2019-06-11 09:02:31 -07:00
aggsum.h OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
arc.h Linux 4.19-rc3+ compat: Remove refcount_t compat 2018-09-26 10:29:26 -07:00
arc_impl.h Linux 4.19-rc3+ compat: Remove refcount_t compat 2018-09-26 10:29:26 -07:00
avl.h Remove dead code from AVL tree 2017-10-05 19:28:00 -07:00
avl_impl.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
blkptr.h OpenZFS 8067 - zdb should be able to dump literal embedded block pointer 2017-07-07 11:28:01 -07:00
bplist.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
bpobj.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
bptree.h Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t 2014-08-06 14:48:41 -07:00
bqueue.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
cityhash.h OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
dataset_kstats.h port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
dbuf.h Decrease contention on dn_struct_rwlock 2019-07-08 13:18:50 -07:00
ddt.h Remove dedupditto functionality 2019-06-19 14:54:02 -07:00
dmu.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
dmu_impl.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_objset.h Pool allocation classes 2018-09-05 18:33:36 -07:00
dmu_recv.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_redact.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_send.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_traverse.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_tx.h Linux 4.19-rc3+ compat: Remove refcount_t compat 2018-09-26 10:29:26 -07:00
dmu_zfetch.h Decrease contention on dn_struct_rwlock 2019-07-08 13:18:50 -07:00
dnode.h Remove code for zfs remap 2019-06-24 16:44:01 -07:00
dsl_bookmark.h Fix comments on zfs_bookmark_phys 2019-06-22 16:32:26 -07:00
dsl_crypt.h Allow unencrypted children of encrypted datasets 2019-06-20 12:29:51 -07:00
dsl_dataset.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dsl_deadlist.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_deleg.h Remove code for zfs remap 2019-06-24 16:44:01 -07:00
dsl_destroy.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dsl_dir.h Remove code for zfs remap 2019-06-24 16:44:01 -07:00
dsl_pool.h port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
dsl_prop.h Illumos 6171 - dsl_prop_unregister() slows down dataset eviction. 2016-01-12 10:53:12 -08:00
dsl_scan.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_synctask.h OpenZFS 9425 - channel programs can be interrupted 2019-06-22 16:51:46 -07:00
dsl_userhold.h Illumos #3740 2013-11-04 11:17:48 -08:00
edonr.h OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
efi_partition.h Fix spelling 2017-01-03 11:31:18 -06:00
frame.h Suppress incorrect objtool warnings 2017-12-07 10:28:50 -08:00
hkdf.h Encryption patch follow-up 2017-10-11 16:54:48 -04:00
metaslab.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
metaslab_impl.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
mmp.h MMP interval and fail_intervals in uberblock 2019-03-21 12:47:57 -07:00
mntent.h Make zfs mount according to relatime config in dataset 2016-04-05 18:55:59 -07:00
multilist.h Avoid extra taskq_dispatch() calls by DMU 2019-06-25 12:03:38 -07:00
note.h Update build system and packaging 2018-05-29 16:00:33 -07:00
nvpair.h Add new fnvlist_lookup_* functions 2018-10-03 15:30:55 -07:00
nvpair_impl.h OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations 2018-07-30 11:30:03 -07:00
objlist.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
pathname.h Disable unused pathname::pn_path* (unneeded in Linux) 2019-07-15 13:57:56 -07:00
policy.h Add `zfs allow` and `zfs unallow` support 2016-06-07 09:16:52 -07:00
range_tree.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
refcount.h Add zfs_refcount_transfer_ownership_many() 2018-10-09 10:05:48 -07:00
rrwlock.h Linux 4.19-rc3+ compat: Remove refcount_t compat 2018-09-26 10:29:26 -07:00
sa.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
sa_impl.h Linux 4.19-rc3+ compat: Remove refcount_t compat 2018-09-26 10:29:26 -07:00
sdt.h Add line info and SET_ERROR() to ZFS debug log 2017-07-25 23:09:48 -07:00
sha2.h OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
skein.h OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
spa.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
spa_boot.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
spa_checkpoint.h Serialize ZTHR operations to eliminate races 2019-01-13 10:09:46 -08:00
spa_checksum.h Implementation of AVX2 optimized Fletcher-4 2016-06-02 14:30:51 -07:00
spa_impl.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
spa_log_spacemap.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
space_map.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
space_reftree.h Illumos #4101, #4102, #4103, #4105, #4106 2014-07-22 09:39:16 -07:00
sysevent.h OpenZFS 6939 - add sysevents to zfs core for commands 2017-07-12 21:28:13 -07:00
trace.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_acl.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_arc.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_common.h OpenZFS 6531 - Provide mechanism to artificially limit disk performance 2016-05-26 10:11:51 -07:00
trace_dbgmsg.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_dbuf.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_dmu.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_dnode.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_multilist.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_rrwlock.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_txg.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_vdev.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_zil.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_zio.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
trace_zrlock.h 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
txg.h OpenZFS 9425 - channel programs can be interrupted 2019-06-22 16:51:46 -07:00
txg_impl.h OpenZFS 9464 - txg_kick() fails to see that we are quiescing 2018-06-04 14:56:06 -07:00
u8_textprep.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
u8_textprep_data.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
uberblock.h Multi-modifier protection (MMP) 2017-07-13 13:54:00 -04:00
uberblock_impl.h MMP interval and fail_intervals in uberblock 2019-03-21 12:47:57 -07:00
uio_impl.h deadlock between mm_sem and tx assign in zfs_write() and page fault 2018-10-16 11:11:24 -07:00
unique.h Illumos #3742 2013-11-04 10:55:25 -08:00
uuid.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
vdev.h Add TRIM support 2019-03-29 09:13:20 -07:00
vdev_disk.h Add support for autoexpand property 2018-07-23 15:40:15 -07:00
vdev_file.h Use a dedicated taskq for vdev_file 2016-12-21 10:47:15 -08:00
vdev_impl.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
vdev_indirect_births.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_indirect_mapping.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_initialize.h Add TRIM support 2019-03-29 09:13:20 -07:00
vdev_raidz.h Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_impl.h Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_removal.h panic in removal_remap test on 4K devices 2019-06-13 13:12:39 -07:00
vdev_trim.h Add TRIM support 2019-03-29 09:13:20 -07:00
xvattr.h Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zap.h fat zap should prefetch when iterating 2019-06-12 13:13:09 -07:00
zap_impl.h OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space 2017-03-07 09:51:59 -08:00
zap_leaf.h Fix ENOSPC in "Handle zap_add() failures in ..." 2018-04-18 14:19:50 -07:00
zcp.h OpenZFS 9425 - channel programs can be interrupted 2019-06-22 16:51:46 -07:00
zcp_global.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
zcp_iter.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
zcp_prop.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
zfeature.h Revert "zhack: Add 'feature disable' command" 2016-05-17 11:52:07 -07:00
zfs_acl.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_context.h OpenZFS 9425 - channel programs can be interrupted 2019-06-22 16:51:46 -07:00
zfs_ctldir.h Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_debug.h Log Spacemap Project 2019-07-16 10:11:49 -07:00
zfs_delay.h Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_dir.h port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
zfs_fuid.h Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ioctl.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
zfs_onexit.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
zfs_project.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_ratelimit.h Change checksum & IO delay ratelimit values 2018-03-04 17:34:51 -08:00
zfs_rlock.h OpenZFS 9689 - zfs range lock code should not be zpl-specific 2018-10-11 10:19:33 -07:00
zfs_sa.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_stat.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
zfs_sysfs.h Fix in-kernel sysfs entries 2018-09-06 21:44:52 -07:00
zfs_vfsops.h Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
zfs_vnops.h RHEL 7.5 compat: FMODE_KABI_ITERATE 2018-05-02 15:01:24 -07:00
zfs_znode.h Fix `zfs set atime|relatime=off|on` behavior on inherited datasets 2019-05-07 10:06:30 -07:00
zil.h make zil max block size tunable 2019-06-10 11:48:42 -07:00
zil_impl.h make zil max block size tunable 2019-06-10 11:48:42 -07:00
zio.h Remove dedupditto functionality 2019-06-19 14:54:02 -07:00
zio_checksum.h Remove dependency on linear ABD 2017-03-29 12:24:51 -07:00
zio_compress.h lz4_decompress_abd declared but not defined 2019-06-13 13:14:34 -07:00
zio_crypt.h Add support for decryption faults in zinject 2018-05-02 15:36:20 -07:00
zio_impl.h Add TRIM support 2019-03-29 09:13:20 -07:00
zio_priority.h Add TRIM support 2019-03-29 09:13:20 -07:00
zpl.h Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zrlock.h OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
zthr.h Serialize ZTHR operations to eliminate races 2019-01-13 10:09:46 -08:00
zvol.h Add port of FreeBSD 'volmode' property 2017-07-12 13:05:37 -07:00