Commit Graph

4637 Commits

Author SHA1 Message Date
Matthew Ahrens 2e5dc449c1 OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
2018-07-12 10:49:27 -07:00
Don Brady e4e94ca315 OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1
Closes #7700
2018-07-11 15:55:48 -07:00
Andriy Gapon e902ddb0f8 OpenZFS 9479 - fix wrong format specifier for vdev_id
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9479
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/20aa447c
Closes #7699
2018-07-11 15:53:02 -07:00
Brian Behlendorf ac09630d8b
Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock.  This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7598 
Closes #7659 
Closes #7691 
Closes #7693
2018-07-11 15:49:10 -07:00
Brian Behlendorf 33a19e0fd9
Fix kernel unaligned access on sparc64
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time.  Otherwise
fallback to using the sa_copy_data() function.

The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space.  In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7642 
Closes #7684
2018-07-11 13:10:40 -07:00
Matthew Ahrens 2dca37d8dc OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')"
Ztest failed with the following crash.

    ::status

    debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter
    file: /usr/bin/amd64/ztest
    initial argv: /usr/bin/amd64/ztest
    threading model: raw lwps
    status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1
    panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid
    value type 'function' for key 'error')

    ::stack

    libc.so.1`_lwp_kill+0xa()
    libc.so.1`_assfail+0x182(fffffd7fffdfe8d0, 0, 0)
    libc.so.1`assfail+0x19(fffffd7fffdfe8d0, 0, 0)
    libzpool.so.1`vpanic+0x3d(fffffd7ffaa58c20, fffffd7fffdfeb00)
    0xfffffd7ffaa28146()
    0xfffffd7ffaa0a109()
    libzpool.so.1`luaD_throw+0x86(3011a48, 2)
    0xfffffd7ffa9350d3()
    0xfffffd7ffa93e3f1()
    libzpool.so.1`zcp_lua_to_nvlist+0x33(3011a48, 1, 2686470, fffffd7ffaa2e2c3)
    libzpool.so.1`zcp_convert_return_values+0xa4(3011a48, 2686470, fffffd7ffaa2e2c3, fffffd7fffdfedd0)
    libzpool.so.1`zcp_pool_error+0x59(fffffd7fffdfedd0, 1e0f450)
    libzpool.so.1`zcp_eval+0x6f8(1e0f450, fffffd7ffaa483f8, 1, 0, 6400000, 1d33b30)
    libzpool.so.1`dsl_destroy_snapshots_nvl+0x12c(2786b60, 0, 484750)
    libzpool.so.1`dsl_destroy_snapshot+0x4f(fffffd7fffdfef70, 0)
    ztest_dsl_dataset_cleanup+0xea(fffffd7fffdff4c0, 1)
    ztest_dataset_destroy+0x53(1)
    ztest_run+0x59f(fffffd7fff0e0498)
    main+0x7ff(1, fffffd7fffdffa88)
    _start+0x6c()

The problem is that zcp_convert_return_values() assumes that there's
exactly one value on the stack, but that isn't always true. It ends up
putting the wrong thing on the stack which is then consumed by
zcp_convert_return values, which either adds the wrong message to the
nvlist, or blows up.

The fix is to make sure that callers of zcp_convert_return_values()
clear the stack before pushing their error message, and
zcp_convert_return_values() should VERIFY that the stack is the expected
size.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9424
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb7e57429
Closes #7696
2018-07-10 21:29:23 -07:00
Brian Behlendorf e2cc448b60
Reduce zdb output when pool contains checkpoint
When running zdb without additional arguments against a pool containing
a checkpoint the entire checkpoint spacemap should not be dumped.  Make
this behavior conditional upon passing the -mmmm option as described in
the zdb(8) man page.

     -mmmm   Display every spacemap record.

Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7702
2018-07-10 21:23:17 -07:00
Matthew Ahrens 00c405b4b5 OpenZFS 9454 - ::zfs_blkstats should count embedded blocks
When we do a scrub or resilver, ZFS counts the different types of blocks,
which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to
count embedded blocks.

Porting notes:
* Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional
  as part of the sequential resilver functionality.  Since phys_birth
  would be zero that case should never happen as described above.  This
  is confirmed by the code coverage analysis.  Remove the conditional
  to realign that aspect of this function with OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9454
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d8a447a7
Closes #7697
2018-07-10 10:41:38 -07:00
Prakash Surya ab11916583 OpenZFS 9456 - ztest failure in zil_commit_waiter_timeout
Problem
=======

Illumos bug 8373 was integrated, which now presents a code path where
"dmu_tx_assign" can fail.  When "dmu_tx_assign" fails, it will not issue
the lwb that was passed in to "zil_lwb_write_issue".  As a result, when
"zil_lwb_write_issue" returns, the lwb will still be in the "opened"
state, just as it was when "zil_lwb_write_issue" was originally called.

Solution
========

As a result of this new call path, the failed assertion needs to be
modified to be aware of this new possibility. Thus, we can only assert
that the lwb is no longer in the "opened" state if the returned lwb is
non-null, since we cannot differentiate between the case of
"dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to
"zil_lwb_write_issue".

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9456
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8b09f4e
Closes #7695
2018-07-10 10:25:14 -07:00
Serapheim Dimitropoulos a7ed98d8b5 OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.

The problem can be seen by attempting to create a dataset using the -p
option with many levels:

    panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff

    fffffffffbc3aa60 unix:die+100 ()
    fffffffffbc3ab70 unix:trap+157d ()
    ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
    ffffff00083d7050 zfs:dbuf_rele+2e ()
    ...
    ffffff00083d7080 zfs:dsl_dir_close+32 ()
    ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
    ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
    ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
    ffffff00083d7130 zfs:dbuf_rele+2e ()
    ... The block above repeats once per directory in the ...
    ... create -p command, working towards the root ...
    ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
    ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
    ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
    ffffff00083dbaa0 zfs:findfunc+23 ()
    ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
    ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
    ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
    ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
    ffffff00083dbd00 genunix:cdev_ioctl+45 ()
    ffffff00083dbd40 specfs:spec_ioctl+5a ()
    ffffff00083dbdc0 genunix:fop_ioctl+7b ()
    ffffff00083dbec0 genunix:ioctl+18e ()
    ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()

Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K.  Enough time has passed
  that 16K can be reasonably assumed to be the default value.  It
  was increased in the 3.15 kernel released in June of 2014.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>

OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
2018-07-09 13:02:50 -07:00
Brian Behlendorf 66df02497c
ZTS: clean_mirror and scrub_mirror cleanup
Remove the dependency on partitionable devices for the clean_mirror
and scrub_mirror test cases.  This allows for the setup and cleanup
of the test cases to be simplified by removing the need for complex
partitioning.

This change also resolves a issue where the clean_mirror devices
were not being properly damaged since the device name was not a
full path.  The result being loopX files were being left in the
top level test_results directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7434 
Closes #7690
2018-07-09 12:46:14 -07:00
Troels Nørgaard 94370f5955 Default ashift for Amazon EC2 NVMe devices
Add a default 4 KiB ashift for Amazon EC2 NVMe devices on instances with
NVMe ephemeral devices, such as the types c5d, f1, i3 and m5d.
As per the official documentation [1] a 4096 byte blocksize should be
used to match the underlying hardware.

The string was identified via:

$ sudo sginfo -M /dev/nvme0n1
INQUIRY response (cmd: 0x12)
----------------------------
Device Type                        0
Vendor:                    NVMe
Product:                   Amazon EC2 NVMe
Revision level:

$ lsblk -io KNAME,TYPE,SIZE,MODEL
KNAME   TYPE    SIZE MODEL
nvme0n1 disk  442.4G Amazon EC2 NVMe Instance Storage

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
    storage-optimized-instances.html
    Retrived 2018-07-03

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Troels Nørgaard <tnn@tradeshift.com>
Closes #7676
2018-07-06 16:15:19 -07:00
Paul Dagnelie 3f4e6d6f36 Cause autogen.sh to fail if autoreconf fails
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7683
2018-07-06 09:27:37 -07:00
Serapheim Dimitropoulos 4d044c4c1d OpenZFS 9238 - ZFS Spacemap Encoding V2
Motivation
==========

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
    This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
    device removal, zpool checkpoint) we've started imposing limits in the
    vdevs that can be used with them based on the maximum addressable offset
    (currently 64PB for a top-level vdev).

New encoding
============

The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.

Other references:
=================

The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
2018-07-05 12:02:34 -07:00
Ahmed Gahnem 4e82b4be78 OpenZFS 9184 - Add ZFS performance test for fixed blocksize random read/write IO
This change introduces a new performance test which does random reads
and writes, but instead of using `bssplit` to determine the block size,
it uses a fixed blocksize. Additionally, some new IO sizes are added to
other tests and timestamp data is recorded with the performance data.

Authored by: Ahmed Gahnem <ahmedg@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Requires-builders: perf

OpenZFS-issue: https://www.illumos.org/issues/9184
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/659
External-issue: DLPX-46724
Closes #7660
2018-07-02 13:46:06 -07:00
Tom Caputi 370bbf66ae Fix coverity defects: CID 176037
CID 176037: Uninitialized scalar variable

This patch fixes an uninitialized variable defect caught by
coverity and introduced in 69830602

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7667
2018-07-02 13:37:48 -07:00
Brian Behlendorf e03a41a604
ZTS: Improve enospc tests
The enospc_002_pos test case would frequently fail due a command
succeeding when it was expected to fail due to lack of space.
In order to make this far less likely, files are created across
multiple transaction groups in order to consume as many unused
blocks as possible.

The dependency that the tests run on a partitioned block device
has been removed.  It's simpler to use sparse files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7663
2018-06-29 09:40:32 -07:00
Tom Caputi da2feb42fb Fix 'zfs recv' of non large_dnode send streams
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7617 
Closes #7662
2018-06-28 14:55:11 -07:00
Chunwei Chen edf60b8645 Enforce PROP_ONETIME on zpool properties
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #7661
2018-06-28 14:49:17 -07:00
Tom Caputi 69830602de Raw receive fix and encrypted objset security fix
This patch fixes two problems with the encryption code. First, the
current code does not correctly prohibit the DMU from updating
dn_maxblkid during object truncation within a raw receive. This
usually only causes issues when the truncating DRR_FREE record is
aggregated with DRR_FREE records later in the receive, so it is
relatively hard to hit.

Second, this patch fixes a security issue where reading blocks
within an encrypted object did not guarantee that the dnode block
itself had ever been verified against its MAC. Usually the
verification happened anyway when the bonus buffer was read, but
some use cases (notably zvols) might never perform the check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7632
2018-06-28 09:20:34 -07:00
Tim Chase 3be1eb29da Fix formatting in zpool-features(5)
The formatting of the features beginning with large_blocks was broken
when the zpool_checkpoint feature was added.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7658
2018-06-27 09:34:25 -07:00
Eitan Adler fb8a10d5be OpenZFS 9521 - Add checkpoint field
Add checkpoint field in the default list of the zpool-list man page

Authored by: Eitan Adler <lists@eitanadler.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: kpande <github@tripleback.net>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9521
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5a860f7b
Closes #7658
2018-06-27 09:33:37 -07:00
Low-power aebfb84851 Fix missing option '-e' in zpool online usage
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #7655
2018-06-26 10:17:55 -07:00
Serapheim Dimitropoulos d2734cce68 OpenZFS 9166 - zfs storage pool checkpoint
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
2018-06-26 10:07:42 -07:00
Tim Chase 88eaf610d9 Use "eval" in history_002_pos for log_must
Otherwise the output is consumed by the output redirection.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7570
2018-06-26 09:58:05 -07:00
ajs124 a8577bdb32 Fix duplicate "fB" typo
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: ajs124 <git@ajs124.de>
Closes #7649
2018-06-25 09:50:01 -07:00
Serapheim Dimitropoulos 7637ef8d23 OpenZFS 9591 - ms_shift can be incorrectly changed
ms_shift can be incorrectly changed changed in MOS config for
indirect vdevs that have been historically expanded

According to spa_config_update() we expect new vdevs to have
vdev_ms_array equal to 0 and then we go ahead and set their metaslab
size. The problem is that indirect vdevs also have vdev_ms_array == 0
because their metaslabs are destroyed once their removal is done.

As a result, if a vdev was expanded and then removed may have its
ms_shift changed if another vdev was added after its removal.
Fortunately this behavior does not cause any type of crash or bad
behavior in the kernel but it can confuse zdb and anyone doing any kind
of analysis of the history of the pools.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651
OpenZFS-issue: https://illumos.org/issues/9591a
External-issue: DLPX-58879
Closes #7644
2018-06-21 09:35:26 -07:00
Matthew Ahrens af43029484 Remove suffix from zio taskq names
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace.  This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.

We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.

Note that we already support multiple taskq's with the same name.  This
happens when there are multiple pools.  In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7646
2018-06-20 14:07:50 -07:00
Brian Behlendorf e4a3297a04
ZTS: Adopt OpenZFS test analysis script
Adopt and extend the OpenZFS ZTS results analysis script for use
with ZFS on Linux.  This allows for automatic analysis of tests
which may be skipped for a variety or reasons or which are not
entirely reliable.

In addition to the list of 'known' failures, which have been updated
for ZFS on Linux, there in a new 'maybe' section.  This mapping
include tests which might be correctly skipped depending on the
test environment.  This may be because of a missing dependency or
lack of required kernel support.  This list also includes tests
which normally pass but might on occasion fail for a harmless
reason.

The script was also extended include a reason for why a given test
might be skipped or may fail.  The reason will be included after
the test in the "results other than PASS that are expected" section.
For failures it is preferable to set the reason to the GitHub issue
number and for skipped tests several generic reasons are available.
You may also specify a custom reason if needed.

All tests were added back in to the linux.run file even if they are
expected to failed.  There is value in running tests which may not
pass, the expected results for these tests has been encoded in
the new analysis script.

All tests which were disabled because they ran more slowly on a
32-bit system have been re-enabled.  Developers working on 32-bit
systems should assess what it reasonable for their environment.

The unnecessary dependency on physical block devices was removed for
the checksum, grow_pool, and grow_replicas test groups so they are
no longer skipped.  Updated the filetest_001_pos test case to run
properly now that it is enabled and moved the grow tests in to a
single directory.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7638
2018-06-20 14:03:13 -07:00
Brian Behlendorf 1c38ac61e1
Linux 4.14 compat: blk_queue_stackable()
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212.  This change
resulted in the default elevator being used which can negatively
impact performance.

Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator.  Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().

Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269.  Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7645
2018-06-19 21:52:45 -07:00
Brian Behlendorf 6413c95fbd
Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe.  As part of this change the current_time() function was
also updated to return the timespec64 type.

Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode.  It should
be used when working with inode timestamps to ensure matching types.

The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t.  Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.

Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other.  They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
2018-06-19 21:51:18 -07:00
kpande aeb39df726 Fix typo in comment, handeling->handling
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #7641
2018-06-18 21:51:06 -07:00
Tom Caputi cd32e5db8b Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
2018-06-18 14:10:54 -07:00
Richard Yao 517d247192 copy-builtin: SPL must be in Kbuild first
The recent SPL merge caused a regression in kernels with ZFS integrated
into the sources where our modules would be initialized in alphabetical
order, despite icp requiring the spl module be loaded first. This caused
kernels with ZFS builtin to fail to boot.

We resolve this by adding a special case for the spl that lists it
first. It is somewhat ugly, but it works.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #7595 
Closes #7606
2018-06-15 15:16:29 -07:00
John Gallagher 917f475fba Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618
2018-06-15 15:10:42 -07:00
Brian Behlendorf 7b98f0d91f
Linux compat 4.18: check_disk_size_change()
Added support for the bops->check_events() interface which was
added in the 2.6.38 kernel to replace bops->media_changed().
Fully implementing this functionality allows the volume resize
code to rely on revalidate_disk(), which is the preferred
mechanism, and removes the need to use check_disk_size_change().

In order for bops->check_events() to lookup the zvol_state_t
stored in the disk->private_data the zvol_state_lock needs to
be held.  Since the check events interface may poll the mutex
has been converted to a rwlock for better concurrently.  The
rwlock need only be taken as a writer in the zvol_free() path
when disk->private_data is set to NULL.

The configure checks for the block_device_operations structure
were consolidated in a single kernel-block-device-operations.m4
file.

The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks
and assoicated dead code was removed.  This interface was added
to the 2.6.28 kernel which predates the oldest supported 2.6.32
kernel and will therefore always be available.

Updated maximum Linux version in META file.  The 4.17 kernel
was released on 2018-06-03 and ZoL is compatible with the
finalized kernel.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7611
2018-06-15 15:05:21 -07:00
Allan Jude 29445fe3a0 Reserve DMU_BACKUP_FEATURE for ZSTD
Reserve bit 25 for the ZSTD compression feature from FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #7626
2018-06-14 09:47:26 -07:00
Brian Behlendorf 7e0594a3da
Remove libefi __linux__ wrappers
The ZoL version of libefi has been modified for Linux in several
places outside the existing __linux__ wrappers.  Remove them to
make the code easier to read and so as not to mislead anyone that
these are the sole modifications for Linux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7625
2018-06-14 09:43:32 -07:00
Brian Behlendorf c91cf36fc2
Fix ztest_vdev_add_remove() test case
Commit 2ffd89fc allowed two new errors to be reported by zil_reset()
in order to provide a descriptive error message regarding why a log
device could not be removed.  However, the new return values were
not handled in the ztest_vdev_add_remove() test case resulting in
ztest failures during automated testing.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7630
2018-06-14 09:41:27 -07:00
Matthew Ahrens 1fac63e56f OpenZFS 9577 - remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
   TSD values (by setting to NULL vs non-NULL) is expensive, and
   we do it very often.
3. According to Nexenta, the current semantics can cause a
   deadlock when concurrently calling dmu_objset_evict_dbufs()
   (which is rare today, but they are working on a "parallel
   unmount" change that triggers this more easily):

Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes #7602
2018-06-13 11:05:06 -07:00
Brian Behlendorf 232dd8b956
Fix efi_get_info() zvol detection
Partition detection for zvol devices was not working correctly
resulting inconsistent partitioning behavior when layering pools
on top of zvols.  This isn't a supported configuration but we'd
still like it to work properly.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7624
2018-06-13 10:20:58 -07:00
John Gallagher ab24877bd3 ZTS: deletes home directories in /export/home
In the cleanup for the privilege tests, an empty variable, empty because
the corresponding setup is skipped on Linux, results in /export/home
being deleted. This patch adds an assertion that the variable is not
empty, and causes the cleanup to be skipped on Linux as well.

Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1099
Closes #7615
2018-06-12 10:42:26 -07:00
bunder2015 5277571260 ZTS: cleanup user_run
user_run leaves two files in /tmp, moving them to $TEST_BASE_DIR and
adding them to the default cleanup routine.

Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7614
2018-06-12 10:37:12 -07:00
Tom Caputi c634808ebb Add pyzfs build directories to gitignore
The recent addition of pyzfs does not include the generated 'build'
and 'pyzfs.egg-info' directories in the pyzfs .gitignore or the
'make clean' target. This patch simply corrects this problem.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7612
2018-06-11 18:42:12 -07:00
Paul Zuchowski 2ffd89fcb9 Wrong error message when removing log device
In the case where the pool is loaded without the crypto
keys necessary to playback the intent log, and log device
removal is attempted, a generic busy message is received.
Change the message to inform the user that the datasets
must be mounted.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7518
2018-06-07 18:07:29 -07:00
Brian Behlendorf 174bcd581d
Fix preemptible warning in aggsum_add()
In the new aggsum counters the CPU_SEQID macro should be surrounded by
kpreempt_disable)() and kpreempt_enable() calls to prevent a Linux
kernel BUG warning.  The addsum_add() function use the cpuid to
minimize lock contention when selecting a bucket, after selection
the bucket is protected by a mutex and it is safe to reschedule the
process to a different processor at any time.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7609
Closes #7610
2018-06-07 15:55:11 -07:00
Antonio Russo 39042f9736 Tunable directory for zfs runtime scripts
zpool and zed place scripts in subdirectories of libexecdir. Some
distributions locate architecture independent scripts in other locations
(e.g. Debian). To avoid these paths getting out of sync, centralize the
definitions.

Build zfs-test's default.cfg by Makefile.  Use the new directory
logic building tests/zfs-tests/include/default.cfg.in.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7597
2018-06-07 09:59:59 -07:00
Nathaniel Clark fba33c3819 Don't panic on bad SA_MAGIC in sa_build_index
If sa_build_index() encounters a corrupt buffer, don't panic.
Add info to zfs ring buffer and return EIO.  This allows for a cleaner
error recovery path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Issue #6500 
Closes #7487
2018-06-07 09:51:56 -07:00
Antonio Russo 7106b23640 Minor documentation, logging, and testing typos
This patch collects some minor inconsistencies and typos in the
documentation, logging and testing infrastructure.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7608
2018-06-07 09:38:39 -07:00
Tom Caputi b405837a6c Update the correct abd in l2arc_read_done()
This patch fixes an issue where l2arc_read_done() would always
write data to b_pabd, even if raw encrypted data was requested.
This only occured in cases where the L2ARC device had a different
ashift than the main pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7586 
Closes #7593
2018-06-06 10:17:50 -07:00