History

Serapheim Dimitropoulos 93e28d661e Log Spacemap Project = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent `db587941c5` Update vdev_is_spacemap_addressable() for new spacemap encoding `419ba59145` Simplify spa_sync by breaking it up to smaller functions `8dc2197b7b` Factor metaslab_load_wait() in metaslab_load() `b194fab0fb` Rename range_tree_verify to range_tree_verify_not_present `df72b8bebe` Change target size of metaslabs from 256GB to 16GB `c853f382db` zdb -L should skip leak detection altogether `21e7cf5da8` vs_alloc can underflow in L2ARC vdevs `7558997d2f` Simplify log vdev removal code `6c926f426a` Get rid of space_map_update() for ms_synced_length `425d3237ee` Introduce auxiliary metaslab histograms `928e8ad47d` Error path in metaslab_load_impl() forgets to drop ms_sync_lock `8eef997679` = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8442		2019-07-16 10:11:49 -07:00
..
runfiles	Log Spacemap Project	2019-07-16 10:11:49 -07:00
test-runner	Fix ZTS killed processes detection	2019-07-10 11:44:52 -07:00
zfs-tests	Log Spacemap Project	2019-07-16 10:11:49 -07:00
Makefile.am	Add the ZFS Test Suite	2016-03-16 13:46:16 -07:00
README.md	Update `tests/README.md` and fix markdown	2018-05-15 09:01:28 -07:00

README.md

ZFS Test Suite README

Building and installing the ZFS Test Suite

The ZFS Test Suite runs under the test-runner framework. This framework is built along side the standard ZFS utilities and is included as part of zfs-test package. The zfs-test package can be built from source as follows:

$ ./configure
$ make pkg-utils

The resulting packages can be installed using the rpm or dpkg command as appropriate for your distributions. Alternately, if you have installed ZFS from a distributions repository (not from source) the zfs-test package may be provided for your distribution.

- Installed from source
$ rpm -ivh ./zfs-test*.rpm, or
$ dpkg -i ./zfs-test*.deb,

- Installed from package repository
$ yum install zfs-test
$ apt-get install zfs-test

Running the ZFS Test Suite

The pre-requisites for running the ZFS Test Suite are:

Three scratch disks
- Specify the disks you wish to use in the $DISKS variable, as a space delimited list like this: DISKS='vdb vdc vdd'. By default the zfs-tests.sh sciprt will construct three loopback devices to be used for testing: DISKS='loop0 loop1 loop2'.
A non-root user with a full set of basic privileges and the ability to sudo(8) to root without a password to run the test.
Specify any pools you wish to preserve as a space delimited list in the $KEEP variable. All pools detected at the start of testing are added automatically.
The ZFS Test Suite will add users and groups to test machine to verify functionality. Therefore it is strongly advised that a dedicated test machine, which can be a VM, be used for testing.

Once the pre-requisites are satisfied simply run the zfs-tests.sh script:

$ /usr/share/zfs/zfs-tests.sh

Alternately, the zfs-tests.sh script can be run from the source tree to allow developers to rapidly validate their work. In this mode the ZFS utilities and modules from the source tree will be used (rather than those installed on the system). In order to avoid certain types of failures you will need to ensure the ZFS udev rules are installed. This can be done manually or by ensuring some version of ZFS is installed on the system.

$ ./scripts/zfs-tests.sh

The following zfs-tests.sh options are supported:

-v          Verbose zfs-tests.sh output When specified additional
            information describing the test environment will be logged
            prior to invoking test-runner.  This includes the runfile
            being used, the DISKS targeted, pools to keep, etc.

-q          Quiet test-runner output.  When specified it is passed to
            test-runner(1) which causes output to be written to the
            console only for tests that do not pass and the results
            summary.

-x          Remove all testpools, dm, lo, and files (unsafe).  When
            specified the script will attempt to remove any leftover
            configuration from a previous test run.  This includes
            destroying any pools named testpool, unused DM devices,
            and loopback devices backed by file-vdevs.  This operation
            can be DANGEROUS because it is possible that the script
            will mistakenly remove a resource not related to the testing.

-k          Disable cleanup after test failure.  When specified the
            zfs-tests.sh script will not perform any additional cleanup
            when test-runner exists.  This is useful when the results of
            a specific test need to be preserved for further analysis.

-f          Use sparse files directly instread of loopback devices for
            the testing.  When running in this mode certain tests will
            be skipped which depend on real block devices.

-c          Only create and populate constrained path

-I NUM      Number of iterations

-d DIR      Create sparse files for vdevs in the DIR directory.  By
            default these files are created under /var/tmp/.

-s SIZE     Use vdevs of SIZE (default: 4G)

-r RUNFILE  Run tests in RUNFILE (default: linux.run)

-t PATH     Run single test at PATH relative to test suite

-T TAGS     Comma separated list of tags (default: 'functional')

-u USER     Run single test as USER (default: root)

The ZFS Test Suite allows the user to specify a subset of the tests via a runfile or list of tags.

The format of the runfile is explained in test-runner(1), and the files that zfs-tests.sh uses are available for reference under /usr/share/zfs/runfiles. To specify a custom runfile, use the -r option:

$ /usr/share/zfs/zfs-tests.sh -r my_tests.run

Otherwise user can set needed tags to run only specific tests.

Test results

While the ZFS Test Suite is running, one informational line is printed at the end of each test, and a results summary is printed at the end of the run. The results summary includes the location of the complete logs, which is logged in the form /var/tmp/test_results/[ISO 8601 date]. A normal test run launched with the zfs-tests.sh wrapper script will look something like this:

$ /usr/share/zfs/zfs-tests.sh -v -d /tmp/test

--- Configuration ---
Runfile:         /usr/share/zfs/runfiles/linux.run
STF_TOOLS:       /usr/share/zfs/test-runner
STF_SUITE:       /usr/share/zfs/zfs-tests
STF_PATH:        /var/tmp/constrained_path.G0Sf
FILEDIR:         /tmp/test
FILES:           /tmp/test/file-vdev0 /tmp/test/file-vdev1 /tmp/test/file-vdev2
LOOPBACKS:       /dev/loop0 /dev/loop1 /dev/loop2 
DISKS:           loop0 loop1 loop2
NUM_DISKS:       3
FILESIZE:        4G
ITERATIONS:      1
TAGS:            functional
Keep pool(s):    rpool


/usr/share/zfs/test-runner/bin/test-runner.py  -c /usr/share/zfs/runfiles/linux.run \
    -T functional -i /usr/share/zfs/zfs-tests -I 1
Test: /usr/share/zfs/zfs-tests/tests/functional/arc/setup (run as root) [00:00] [PASS]
...more than 1100 additional tests...
Test: /usr/share/zfs/zfs-tests/tests/functional/zvol/zvol_swap/cleanup (run as root) [00:00] [PASS]

Results Summary
SKIP	  52
PASS	 1129

Running Time:	02:35:33
Percent passed:	95.6%
Log directory:	/var/tmp/test_results/20180515T054509