Commit Graph

3048 Commits

Author SHA1 Message Date
Matthew Ahrens d64c6a2eee
Use zfs_dbgmsg to log metaslab_load/unload
Metaslabs are now (usually) loaded and unloaded infrequently, but when
that is not the case, it is useful to have a log of when and why these
events happened.

This commit enables the zfs_dbgmsg() in metaslab_load(), and adds a
zfs_dbgmsg() in metaslab_unload().

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10683
2020-08-12 10:10:50 -07:00
Matthew Macy e111c80247
Restore ARC MFU/MRU pressure
The arc_adapt() function tunes LRU/MLU balance according to 4 types of
cache hits (which is passed as state agrument): ghost LRU, LRU, MRU,
ghost MRU. If this function is called with wrong cache hit (state),
adaptation will be sub-optimal and performance will suffer.

Some time ago upstream received this commit:

6950 ARC should cache compressed data) in arc_read() do next
sequence (access to ghost buffer)

Before this commit, hit to any ghost list was passed arc_adapt() before
call to arc_access() which revive element in cache and change state from
ghost to real hit.

After this commit, the order of calls was reverted and arc_adapt() is
now called only with «real» hits even if hit was in one of two ghost
lists, which renders ghost lists useless and breaks the ARC algorithm.

FreeBSD fixed this problem locally in Change D19094 / Commit r348772.

This change is an adaptation of the above commit to the current arc
code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10548 
Closes #10618
2020-08-12 10:03:24 -07:00
Coleman Kane d817c17100 Linux 5.9 compat: make_request_fn replaced with submit_bio interface
The make_request_fn and associated API was replaced recently in a
Linux 5.9 merge, to replace its functionality with a new submit_bio
member in struct block_device_operations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #10696
2020-08-11 13:37:33 -07:00
Allan Jude 9777044f1c
Fix typo
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #10694
2020-08-11 13:16:57 -07:00
Ryan Moeller ed726fb063
Move ZVOL_DIR back to zfs.h
This was previously moved because nothing else in-tree uses it, but
evidently DilOS uses it out of tree.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@freebsd.org>
Closes #10361 
Closes #10685
2020-08-11 13:12:12 -07:00
Matthew Macy 0f95ddcc0c
FreeBSD: update vaccess signature on most recent HEAD
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10682
2020-08-07 14:16:01 -07:00
Paul Dagnelie 12045d0278
Clarify error message when a range-tree double-add occurs
In various other pieces of logic have resulted in situations where 
we double-free space in ZFS. This in turn results in a double-add 
to the range trees. These issues have been much more difficult to 
diagnose than they should have been, because the error handling 
around this case is much weaker than around the double remove case.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10654
2020-08-07 14:13:13 -07:00
Matthew Ahrens 0cab7970f9 Remove commented-out code
Remove dead code to make the implementation easier to understand.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes #10650
2020-08-05 10:28:18 -07:00
Matthew Ahrens c6f2b942be Remove KMC_NOMAGAZINE
Remove dead code to make the implementation easier to understand.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes #10650
2020-08-05 10:28:07 -07:00
Matthew Ahrens 3c09f6949a Remove KMC_QCACHE
Remove dead code to make the implementation easier to understand.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes #10650
2020-08-05 10:28:01 -07:00
Matthew Ahrens d519c10575 Remove KMC_NOHASH
Remove dead code to make the implementation easier to understand.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes #10650
2020-08-05 10:27:56 -07:00
Matthew Ahrens f68af67a0c Remove KMC_NOTOUCH
Remove dead code to make the implementation easier to understand.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes #10650
2020-08-05 10:27:46 -07:00
Matthew Ahrens 492db125dc Remove KMC_OFFSLAB
Remove dead code to make the implementation easier to understand.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes #10650
2020-08-05 10:25:37 -07:00
Matthew Ahrens d87676a9fa
Fix i/o error handling of livelists and zap iteration
Pool-wide metadata is stored in the MOS (Meta Object Set).  This
metadata is stored in triplicate, in addition to any pool-level
reduncancy (e.g. RAIDZ).  However, if all 3+ copies of this metadata are
not available, we can still get EIO/ECKSUM when reading from the MOS.
If we encounter such an error in syncing context, we have typically
already committed to making a change that we now can't do because of the
corrupt/missing metadata.  We typically "handle" this with a `VERIFY()`
or `zfs_panic_recover()`.  This prevents the system from continuing on
in an undefined state, while minimizing the amount of error-handling
code.

However, there are some code paths that ignore these i/o errors, or
`ASSERT()` that they don't happen.  Since assertions are disabled on
non-debug builds, they effectively ignore them as well.  This can lead
to ZFS continuing on in an incorrect state, potentially leading to
on-disk inconsistencies.

This commit adds handling for these i/o errors on MOS metadata,
typically with a `VERIFY()`:

* Handle error return from `zap_cursor_retrieve()` in 4 places in
`dsl_deadlist.c`.

* Handle error return from `zap_contains()` in `dsl_dir_hold_obj()`.
Turns out this call isn't necessary because we can always call
`zap_lookup()`.

* Handle error return from `zap_lookup()` in `dsl_fs_ss_limit_check()`.

* Handle error return from `zap_remove()` in `dsl_dir_rename_sync()`.

* Handle error return from `zap_lookup()` in
`dsl_dir_remove_livelist()`.

* Handle error return from `dsl_process_sub_livelist()` in
`spa_livelist_delete_cb()`.

Additionally:

* Augment the internal history log message for `zfs destroy` to note
which method is used (e.g. bptree, livelist, or, synchronous) and the
mintxg.

* Correct a comment in `dbuf_init()`.

* Correct indentation in `dsl_dir_remove_livelist()`.

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10643
2020-08-05 10:22:09 -07:00
Matthew Macy 1b376d176e
FreeBSD: Add support for lockless lookup
Authored-by: mjg <mjg@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10657
2020-08-05 10:19:51 -07:00
Matthew Macy 22dcf89181
Add missed thread_exit() to vdev_{autotrim,rebuild}_thread
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10668
2020-08-05 10:17:07 -07:00
George Amanakis da60484db5
Fix logging in l2arc_rebuild()
In case the L2ARC rebuild was canceled, do not log to spa history
log as the pool may be in the process of being removed and a panic
may occur:

BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:spa_history_log_internal+0xb1/0x120 [zfs]
Call Trace:
 l2arc_rebuild+0x464/0x7c0 [zfs]
 l2arc_dev_rebuild_start+0x2d/0x130 [zfs]
 ? l2arc_rebuild+0x7c0/0x7c0 [zfs]
 thread_generic_wrapper+0x78/0xb0 [spl]
 kthread+0xfb/0x130
 ? IS_ERR+0x10/0x10 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10659
2020-08-01 11:17:18 -07:00
Matthew Macy fe628bc21d
Fix page fault in zfsctl_snapdir_getattr
Must acquire the z_teardown_lock before accessing the zfsvfs_t object.
I can't reproduce this panic on demand, but this looks like the
correct solution.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: asomers <asomers@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10656
2020-08-01 08:42:55 -07:00
Allan Jude 8fb79fdddb
Change the error handling for invalid property values
ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist

With a compression= property outside of the understood range:

Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.

After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10631
2020-08-01 08:41:31 -07:00
Matthew Macy 47ed79ff60
Changes to make openzfs build within FreeBSD buildworld
A collection of header changes to enable FreeBSD to build
with vendored OpenZFS.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10635
2020-07-31 21:30:31 -07:00
Ryan Moeller 0cc3454821
Convert Linux-isms to FreeBSD-isms in platform zfs_debug.c
Change some comments copied from the Linux code to describe 
the appropriate methods on FreeBSD.

Convert some tunables to ZFS_MODULE_PARAM so they get created 
on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10647
2020-07-31 21:25:35 -07:00
Matthew Ahrens 3442c2a02d
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-07-31 21:10:52 -07:00
Ryan Moeller 25499e2139
lua: Increase reserved stack space for FreeBSD in debug config
FreeBSD uses more stack space in debug configurations and can overflow
the stack while formatting the error message when the call depth limit
of 20 frames is reached.  This is readily reproduced by running the
gsub recursion test with increased kstack size.  I hit the panic with
16 pages per kstack, and noticed it go away when bumped to 17.

Reserve an additional 64 bytes on the stack when building for FreeBSD.
This is enough to avoid the panic with a deep stack while not wasting
too much space when the default stack size is used.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10634
2020-07-31 09:17:37 -07:00
Allan Jude eabf270b2c
Remove duplicate include of sys/zfeature.h in dmu_objset.c
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10636
2020-07-31 09:04:45 -07:00
Matthew Ahrens 948423a3d1
zfs promote does not delete livelist of origin
When a clone is promoted, its livelist is no longer accurate, so it is
discarded.  If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B

If we promote C, it discards C's livelist.  It should discard B's
livelist, but that is not happening.  The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C.  The incorrectly-freed blocks can be reallocated causing
checksum errors.  And when C is destroyed it can double-free the
incorrectly-freed blocks.

The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir.  So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed.  As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10652
2020-07-31 08:59:00 -07:00
Matthew Ahrens 3a92552f75
Fix error handling of vdev_top_zap
In `vdev_load()`, we look up several entries in the `vdev_top_zap`
object.  In most cases, if we encounter an i/o error, it will be
returned to the caller.  However, when handling
`VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue
on, which in theory could cause us to not realize that a vdev should be
used only for `special` allocations.

In practice, if we encountered an i/o error while looking for
`VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an
i/o error while looking for other entries in the same object, and thus
the zpool open/import would fail.  Therefore the impact of this problem
is negligible.

This commit adds error handling for i/o errors while accessing the
`vdev_top_zap`, so that we aren't relying on unrelated code to fail for
us.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10637
2020-07-29 17:04:34 -07:00
Matthew Macy 27d96d2254
Rename refcount.h to zfs_refcount.h
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10620
2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos 843e9ca2e1
Introduce names for ZTHRs
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #10630
2020-07-29 09:43:33 -07:00
Matthew Macy 5678d3f593
Prefix zfs internal endian checks with _ZFS
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10621
2020-07-28 13:02:49 -07:00
Matthew Ahrens 3eabed74c0
Fix lua stack overflow on recursive call to gsub()
The `zfs program` subcommand invokes a LUA interpreter to run ZFS
"channel programs".  This interpreter runs in a constrained environment,
with defined memory limits.  The LUA stack (used for LUA functions that
call each other) is allocated in the kernel's heap, and is limited by
the `-m MEMORY-LIMIT` flag and the `zfs_lua_max_memlimit` module
parameter.  The C stack is used by certain LUA features that are
implemented in C.  The C stack is limited by `LUAI_MAXCCALLS=20`, which
limits call depth.

Some LUA C calls use more stack space than others, and `gsub()` uses an
unusually large amount.  With a programming trick, it can be invoked
recursively using the C stack (rather than the LUA stack).  This
overflows the 16KB Linux kernel stack after about 11 iterations, less
than the limit of 20.

One solution would be to decrease `LUAI_MAXCCALLS`.  This could be made
to work, but it has a few drawbacks:

1. The existing test suite does not pass with `LUAI_MAXCCALLS=10`.

2. There may be other LUA functions that use a lot of stack space, and
the stack space may change depending on compiler version and options.

This commit addresses the problem by adding a new limit on the amount of
free space (in bytes) remaining on the C stack while running the LUA
interpreter: `LUAI_MINCSTACK=4096`.  If there is less than this amount
of stack space remaining, a LUA runtime error is generated.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10611 
Closes #10613
2020-07-27 16:11:47 -07:00
Matthew Macy e64cc4954c
Refactor ccompile.h to not include system headers
This is a step toward being able to vendor the OpenZFS code in FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10625
2020-07-25 20:09:50 -07:00
Matthew Macy 6d8da84106
Make use of ZFS_DEBUG consistent within kmod sources
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10623
2020-07-25 20:07:44 -07:00
Matthew Macy f5b189f937
FreeBSD: Fixes required to build ZFS on PowerPC
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10622
2020-07-25 11:00:23 -07:00
Ryan Moeller d364de7a89
FreeBSD: Remove accidental ARC size limiter
i386 has some additional memory reservation logic that limits the size
of the reported available memory.  This was accidentally being used on
all arches due to a missing header.

Include machine/vmparam.h in freebsd/zfs/arc_os.c to pull in the
missing UMA_MD_SMALL_ALLOC definition.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10616
2020-07-25 10:49:49 -07:00
Ryan Moeller cb18d88060
FreeBSD: Implement arc_free_memory
This is only used for the kstat, but something other than 0 is nice.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10626
2020-07-25 10:47:18 -07:00
Brian Atkinson 6fba7bfd0e
Add gang ABD child to parent gang ABD
By design a gang ABD can not have another gang ABD as a child. This is
to make sure the logical offset in a gang ABD is consistent with the
individual ABDS it contains as children. If a gang ABD is added as a
child of a gang ABD we will add the individual children of the gang ABD
to the parent gang ABD. This allows for a consistent view of offsets
within the parent gang ABD.

Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10430
2020-07-24 21:09:20 -07:00
Ryan Moeller 8348fac30c
Limit dbuf cache sizes based only on ARC target size by default
Set the initial max sizes to ULONG_MAX to allow the caches to grow
with the ARC.

Recalculate the metadata cache size on demand so it can adapt, too.

Update descriptions in zfs-module-parameters(5).

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10563 
Closes #10610
2020-07-24 20:38:48 -07:00
Matthew Ahrens 4fbdb10c7b
remove kmem_cache module parameter KMC_EXPIRE_AGE
By default, `spl_kmem_cache_expire` is `KMC_EXPIRE_MEM`, meaning that
objects will be removed from kmem cache magazines by
`spl_kmem_cache_reap_now()`.

There is also a module parameter to change this to `KMC_EXPIRE_AGE`,
which establishes a maximum lifetime for objects to stay in the
magazine.  This setting has rarely, if ever, been used, and is not
regularly tested.

This commit removes the code for `KMC_EXPIRE_AGE`, and associated module
parameters.

Additionally, the unused module parameter
`spl_kmem_cache_obj_per_slab_min` is removed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10608
2020-07-24 09:39:26 -07:00
Ryan Moeller f7a68f99d0
FreeBSD: Remove some code duplication in sysctl_os.c
Drop unnecessary redefinition's of several arcstat values.
Put missing extern declaration of arc_no_grow_shift in arc_impl.h.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10609
2020-07-23 17:35:34 -07:00
Matthew Ahrens 5dd92909c6
Adjust ARC terminology
The process of evicting data from the ARC is referred to as
`arc_adjust`.

This commit changes the term to `arc_evict`, which is more specific.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10592
2020-07-22 09:51:47 -07:00
Ryan Moeller 0421f257b2
FreeBSD: Add legacy arc_min and arc_max
These tunables were renamed from vfs.zfs.arc_min and 
vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max.
Add legacy compat tunables for the old names.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10579
2020-07-19 10:15:34 -07:00
Matthew Ahrens 026e529cb3
Remove skc_reclaim, hdr_recl, kmem_cache shrinker
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure.  This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function.  This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.

The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`.  When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op.  When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it.  Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.

The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer.  Therefore, additionally waking it from the
shrinker calbacks has little benefit.

The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second.  Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times.  Therefore, this mechanism consumes
significant amounts of CPU time.

The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
 1. Return free objects from the magazine layer to the slab layer
 2. Return entirely-free slabs to the page layer (i.e. free pages)

These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches.  The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.

These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine.  So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.

We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected.  Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.

This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10576
2020-07-19 09:58:30 -07:00
Brian Behlendorf e862b7ecfc
Linux 4.10 compat: has_capability()
Stock kernels older than 4.10 do not export the has_capability()
function which is required by commit e59a377.  To avoid breaking
the build on older kernels revert to the safe legacy behavior and
return EACCES when privileges cannot be checked.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10565
Closes #10573
2020-07-19 09:56:21 -07:00
Matthew Ahrens 8fbf432ae2
anon_pages are not free/evictable
`arc_free_memory()` returns the amount of memory that the ARC considers
to be free.  This includes pages that are not actually free, but can be
evicted with essentially zero cost (without doing any i/o), for example
the page cache.  The ARC can "squeeze out" any pages included in this
calculation, leaving only `arc_sys_free` (1/64th of RAM) for these
free/evictable pages.

Included in the count of free/evictable pages is
`nr_inactive_anon_pages()`, which is described as "Anonymous memory that
has not been used recently and can be swapped out".  These pages would
have to be written out to disk (swap) in order to evict them, and they
are not included in `/proc/meminfo`'s `MemAvailable`.

Therefore it is not appropriate for `nr_inactive_anon_pages()` to be
included in the free/evictable memory returned by `arc_free_memory()`,
because the ARC shouldn't (intentionally) make the system swap.

This commit removes `nr_inactive_anon_pages()` from the memory returned
by `arc_free_memory()`.  This is a step towards enabling the ARC to
manage free memory by monitoring it and reducing the ARC size as we
notice that there is insufficient free memory (in the `arc_reap_zthr`),
rather than the current method of relying on the `arc_shrinker`
callback.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10575
2020-07-16 10:11:26 -07:00
Matthew Macy 23c871671c
FreeBSD: zfs commands backward compatibility
Update the zfs commands such that they're backwards compatible with
the version of ZFS is the base FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10542
2020-07-15 21:32:50 -07:00
Romain Dolbeau 01a4852ecb
Fix early include of <linux/percpu_compat.h>
Move/add include of <linux/percpu_compat.h> to satisfy missing
requirements.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain@dolbeau.org>
Closes #10568 
Closes #10569
2020-07-15 15:58:15 -07:00
Matthew Ahrens 6774931dfa
Extend zdb to print inconsistencies in livelists and metaslabs
Livelists and spacemaps are data structures that are logs of allocations
and frees.  Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.

These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).

ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab.  However, it
generally halts processing when the error is detected.

When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.

The specific checks added are:

Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]

Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
  spacemap log, then report any double FREEs and double ALLOCs

Verify that livelists are consistenet with spacemaps.  The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes #10515
2020-07-14 17:51:05 -07:00
Alexander Motin 1743c737f5
Fix LOR between dp_config_rwlock and spa_props_lock
Our QE team during automated API testing hit deadlock in ZFS, caused
by lock order reversal.  From one side dsl_sync_task_sync() locks
dp_config_rwlock as writer and calls spa_sync_props(), which waits
for spa_props_lock.  From another spa_prop_get() locks spa_props_lock
and then calls dsl_pool_config_enter(), trying to lock dp_config_rwlock
as reader.

This patch makes spa_prop_get() lock dp_config_rwlock before
spa_props_lock, making the order consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #10553
2020-07-14 12:21:57 -07:00
Brian Atkinson e4d3d77684
Fixing gang ABD child removal race condition
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10511
2020-07-14 11:04:35 -07:00
Matthew Ahrens e59a377a8f
filesystem_limit/snapshot_limit is incorrectly enforced against root
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit."  Two types of users are allowed to change
the limit:

1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`.  This works properly.

2. A user with elevated system privileges (e.g. root).  This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.

The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process.  This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.

This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged.  Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.

This problem only impacts Linux.  Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8226
Closes #10545
2020-07-11 17:18:02 -07:00