Commit Graph

473 Commits

Author SHA1 Message Date
Rob Norris 6b9de042e6 vdev_disk: fix alignment check when buffer has non-zero starting offset
If a linear buffer spans multiple pages, and the first page has a
non-zero starting offset, the checker would not include the offset, and
so would think there was an alignment gap at the end of the first page,
rather than at the start.

That is, for a 16K buffer spread across five pages with an initial 512B
offset:

    [.XXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXX.]

It would be interpreted as:

    [XXXXXXX.][XXXXXXXX]...

And be rejected as misaligned.

Since it's already a linear ABD, the "linearising" copy would just reuse
the buffer as-is, and the second check would failing, tripping the
VERIFY in vdev_disk_io_rw().

This commit fixes all this by including the offset in the check for
end-of-page alignment.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 52f60c29d4)
2024-05-07 02:15:18 +00:00
Rob Norris c85d31a88b taskq: add taskq_try_dispatch_ent()
Non-blocking form of taskq_dispatch_ent(), returns false if it can't
acquire the taskq lock.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Rob Norris 2fec57bf03 trace: spa_taskqs_ent trace class, dispatch and dispatched tracepoints
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Rob Norris 657c1f75ad vdev_disk: don't touch vbio after its handed off to the kernel
After IO is unplugged, it may complete immediately and vbio_completion
be called on interrupt context. That may interrupt or deschedule our
task. If its the last bio, the vbio will be freed. Then, we get
rescheduled, and try to write to freed memory through vbio->.

This patch just removes the the cleanup, and the corresponding assert.
These were leftovers from a previous iteration of vbio_submit() and were
always "belt and suspenders" ops anyway, never strictly required.

Reported-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 34f662ad22206af6852020fd923ceccd836a855f)
2024-04-03 10:22:54 +11:00
Rob Norris 9e1afd0d91 abd_iter_page: don't use compound heads on Linux <4.5
Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages
in a compound page were refcounted separately. This means that using the
head page without taking a reference to it could see it cleaned up later
before we're finished with it. Specifically, bio_add_page() would take a
reference, and drop its reference after the bio completion callback
returns.

If the zio is executed immediately from the completion callback, this is
usually ok, as any data is referenced through the tail page referenced
by the ABD, and so becomes "live" that way. If there's a delay in zio
execution (high load, error injection), then the head page can be freed,
along with any dirty flags or other indicators that the underlying
memory is used. Later, when the zio completes and that memory is
accessed, its either unmapped and an unhandled fault takes down the
entire system, or it is mapped and we end up messing around in someone
else's memory. Both of these are very bad.

The solution on these older kernels is to take a reference to the head
page when we use it, and release it when we're done. There's not really
a sensible way under our current structure to do this; the "best" would
be to keep a list of head page references in the ABD, and release them
when the ABD is freed.

Since this additional overhead is totally unnecessary on 4.5+, where
head and tail pages share refcounts, I've opted to simply not use the
compound head in ABD page iteration there. This is theoretically less
efficient (though cleaning up head page references would add overhead),
but its safe, and we still get the other benefits of not mapping pages
before adding them to a bio and not mis-splitting pages.

There doesn't appear to be an obvious symbol name or config option we
can match on to discover this behaviour in configure (and the mm/page
APIs have changed a lot since then anyway), so I've gone with a simple
version check.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
(cherry picked from commit c6be6ce175)
2024-04-03 10:15:02 +11:00
Rob Norris ce5719554f vdev_disk: use bio_chain() to submit multiple BIOs
Simplifies our code a lot, so we don't have to wait for each and
reassemble them.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
(cherry picked from commit 72fd834c47)
2024-04-03 10:12:51 +11:00
Rob Norris ba6d0fb060 vdev_disk: add module parameter to select BIO submission method
This makes the submission method selectable at module load time via the
`zfs_vdev_disk_classic` parameter, allowing this change to be backported
to 2.2 safely, and disabled in favour of the "classic" submission method
if new problems come up.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 2382fdb0a83a5a3c6cf3860695d3f29281773170)
2024-04-03 10:10:20 +11:00
Rob Norris 1b16b90ae1 vdev_disk: rewrite BIO filling machinery to avoid split pages
This commit tackles a number of issues in the way BIOs (`struct bio`)
are constructed for submission to the Linux block layer.

### BIO segment limits are set incorrectly

The kernel has a hard upper limit on the number of pages/segments that
can be added to a BIO, as well as a separate limit for each device
(related to its queue depth and other scheduling characteristics).

ZFS counts the number of memory pages in the request ABD
(`abd_nr_pages_off()`, and then uses that as the number of segments to
put into the BIO, up to the hard upper limit. If it requires more than
the limit, it will create multiple BIOs.

Leaving aside the fact that page count method is wrong (see below), not
limiting to the device segment max means that the device driver will
need to split the BIO in half. This is alone is not necessarily a
problem, but it interacts with another issue to cause a much larger
problem.

### BIOs are filled inefficiently

The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
`struct page` pointer, and offset+len within it. `struct page` can
represent a run of contiguous memory pages (known as a "compound page").
In can be of arbitrary length.

The ZFS functions that count ABD pages and load them into the BIO
(`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
page` is for multiple pages. In this case, it will load the same `struct
page` into the BIO multiple times, with the offset adjusted each time.

With a sufficiently large ABD, this can easily lead to the BIO being
entirely filled much earlier than it could have been. This is also
further contributes to the problem caused by the incorrect segment limit
calculation, as its much easier to go past the device limit, and so
require a split.

Again, this is not a problem on its own.

### Incomplete pages are submitted to BIOs

The logic for "never submit more than `PAGE_SIZE`" is actually a little
more subtle. It will actually never submit a buffer that crosses a 4K
page boundary.

In practice, this is fine, as most ABDs are scattered, that is a list of
complete 4K pages, and so are loaded in as such.

Linear ABDs are typically allocated from slabs, and for small sizes they
are frequently not aligned to page boundaries. For example, a 12K
allocation can span four pages, eg:

     -- 4K -- -- 4K -- -- 4K -- -- 4K --
    |        |        |        |        |
          :## ######## ######## ######:    [1K, 4K, 4K, 3K]

Such an allocation would be loaded into a BIO as you see:

    [1K, 4K, 4K, 3K]

This tends not to be a problem in practice, because even if the BIO were
filled and needed to be split, each half would still have either a start
or end aligned to the logical block size of the device (assuming 4K at
least).

---

In ideal circumstances, these shortcomings don't cause any particular
problems. Its when they start to interact with other ZFS features that
things get interesting.

### Aggregation

Aggregation will create a "gang" ABD, which is simply a list of other
ABDs. Iterating over a gang ABD is just iterating over each ABD within
it in turn.

Because the segments are simply loaded in order, we can end up with
uneven segments either side of the "gap" between the two ABDs. For
example, two 12K ABDs might be aggregated and then loaded as:

    [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]

Should a split occur, each individual BIO can end up either having an
start or end offset that is not aligned to the logical block size, which
some drivers (eg SCSI) will reject. However, this tends not to happen
because the default aggregation limit usually keeps the BIO small enough
to not require more than one split, and most pages are actually full 4K
pages, so hitting an uneven gap is very rare anyway.

### Gang blocks

If the pool is under particular memory pressure, then an IO can be
broken down into a "gang block", a 512-byte block composed of a header
and up to three block pointers. Each points to a fragment of the
original write, or in turn, another gang block, breaking the original
data up over and over until space can be found in the pool for each of
them.

Each gang header is a separate 512-byte memory allocation from a slab,
that needs to be written down to disk. When the gang header is added to
the BIO, its a single 512-byte segment.

### Aggregation with gang blocks

Pulling all this together, consider a large aggregated write of gang
blocks. This results a BIO containing lots of 512-byte segments. Given
our tendency to overfill the BIO, a split is likely, and most possible
split points will yield a pair of BIOs that are misaligned. Drivers that
care, like the SCSI driver, will reject them.

---

This commit is a substantial refactor and rewrite of much of `vdev_disk`
to sort all this out.

### Configure maximum segment size for device

`vdev_bio_max_segs()` now returns the ideal maximum size for the device,
if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
override this, to assist with testing.

### ABDs checked up front for page count and alignment

We scan the ABD up front to count the number of pages within it, and to
confirm that if we submitted all those pages to one or more BIOs, it
could be split at any point with creating a misaligned BIO. Along the
way, we determine how many BIO segments we'll need to handle the entire
ABD, accounting for BIO fill limits (including segment and byte limits).

If the pages in the BIO are not usable (as in any of the above
situations), the ABD is linearised, and then checked again. This is the
same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's
variable page size and allocator quirks.

In the end, a count of segments is produced, which is then used to
determine how many BIOs will be allocated.

### Virtual block IO object

`vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
idea is simply that it can hold all the state needed to create, submit
and return multiple BIOs, including all the refcounts, the ABD copy if
it was needed, and so on. Apart from what I hope is a clearer interface,
the major difference is that because we know how many BIOs we'll need up
front, we don't need the old overflow logic that would grow the BIO
array, throw away all the old work and restart. We can get it right from
the start.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 588a6a2d38f20cd6e0d458042feda1831b302207)
2024-04-03 10:10:12 +11:00
Rob Norris 21f6808780 vdev_disk: make read/write IO function configurable
This is just setting up for the next couple of commits, which will add a
new IO function and a parameter to select it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 7ee83696cffac172eea89844ccc5e6b6899781ac)
2024-04-03 10:10:08 +11:00
Rob Norris 1feddf1ed6 vdev_disk: reorganise vdev_disk_io_start
Light reshuffle to make it a bit more linear to read and get rid of a
bunch of args that aren't needed in all cases.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit ad847ff6acb77fbba0f3ab2e864784225fd41007)
2024-04-03 10:10:04 +11:00
Rob Norris d00ab549d4 vdev_disk: rename existing functions to vdev_classic_*
This is just renaming the existing functions we're about to replace and
grouping them together to make the next commits easier to follow.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 9bf6a7c8c3bdcc4e5975fa5baf6e9ff6f279a553)
2024-04-03 10:09:57 +11:00
Rob Norris edf4cf0ce7 abd: add page iterator
The regular ABD iterators yield data buffers, so they have to map and
unmap pages into kernel memory. If the caller only wants to count
chunks, or can use page pointers directly, then the map/unmap is just
unnecessary overhead.

This adds adb_iterate_page_func, which yields unmapped struct page
instead.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 930b785c61e9724f0a3a0e09571032ed397f368c)
2024-04-03 10:09:53 +11:00
Richard Yao 609550bc5f Convert enum zio_flag to uint64_t
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14086
2023-09-07 18:54:12 +00:00
Rob Norris d6d343bb87 Revert debug code from bio-seg-limit
This patch reverts parts of bf08bc108dce6f1ecd0820c8b5a67b6fb7962c7e.
2023-08-10 19:10:59 +00:00
Rob Norris bf18873541 bio-seg-limit debug
Various bits of output for catching broken bios.

(cherry picked from commit b1a5bc49acce3cbec56f3bf0638539f836aa2208)
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-31 15:05:56 +00:00
Rob Norris 74e8091130 abd_bio_map_off: avoid splitting scatter pages across bios
This is the same change as the previous commit, but for scatter abds.

Its less clear if this change is needed. Since scatter abds are only
ever added a page at time, both sides of the split should always be
added in consecutive segments.

Intuitively though, it may be possible for a partially-filled bio to be
used, or a bio with an odd number of iovecs, and that then could lead to
a misaligned bio. While I've not been able to reproduce this at all, it
seems to make sense to protect against it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit cbdf21fd1a32a5e696a22cad497d9211221fa309)
2023-07-31 15:05:56 +00:00
Rob Norris 1bd475d6c6 bio_map: avoid splitting abd pages across bios
If we encounter a split page, we add two iovecs to the bio, one for the
fragment of the buffer on each side of the split. In order to do this
safely, we must be sure that we always have room for both fragments.

Its possible for a linear abd to have multiple pages, in which case we
want to add the "left" fragment, then a run of proper 4K pages. then
then "right" fragment. In this way we can keep whole pages together as
much possible.

This change handles both cases by noticing a split page. If we don't
have at least two iovecs remaining in the bio, then we abort outright
(allowing the caller to allocate a new bio and try again). We add the
"left" fragment, and note how big we expect the right fragment to be.
Then we load in as many full pages as are available.

When we reach the last iovec, we close out the bio by taking as uch as
is necessary to restore alignment.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 173cafcc3d8b6c94c61844c705d7a410f412a18e)
2023-07-31 15:05:56 +00:00
Rob Norris 98bcc390e8 vdev_disk: rework bio max segment calculation
A single "page" in an ABD does not necessarily correspond to one segment
in a bio, because of how ZFS does ABD allocations and how it breaks them
up with adding them to a bio. Because of this, simply dividing the ABD
size by the page size can only ever give a minimum number of segments
required, rather than the correct number.

Until we can fix that, we'll just make each bio as large as they can be
for as many segments as the device queue will permit without needing to
split the the bio. This is a little wasteful if we don't intend to put
that many segments in the bio, but its not a lot of memory and its only
lost until the bio is completed.

This also adds a tuneable, vdev_disk_max_segs, to allow setting this
value to be set by the operator. This is very useful for debugging.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit a3a438d1bedb0626417cd73ba10b1479a06bef7f)
2023-07-31 15:05:56 +00:00
Allan Jude 066532da51 Add module parameter to block 0 byte writes
Some hardware has issues when issues a write of 0 bytes
Add a new module paramter, zio_suppress_zero_writes
That when enabled (default) will just complete these I/Os
without sending them to the hardware.

Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Allan Jude 2284c4d200 Add module parameter to block 0 byte writes
Some hardware has issues when issues a write of 0 bytes
Add a new module paramter, zio_suppress_zero_writes
That when enabled (default) will just complete these I/Os
without sending them to the hardware.

Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Rob Norris 2724bcb3d6 zil: allow the ZIL to fail and restart independently of the pool
zil_commit() has always returned void, and thus, cannot fail. Everything
inside it assumed that if anything ever went wrong, it could fall back
on txg_wait_synced() until the txg covering the operations being flushed
from the ZIL has fully committed. This meant that if the pool failed and
failmode=continue was set, syncing operations like fsync() would still
block.

Unblocking zil_commit() means largely the same approach. The difficulty
is that the ZIL carries the record of uncommitted VFS operations (vs the
changed data), and attached to those, callbacks and cvs that will
release userspace callers once the data is on disk. So if we can't write
the ZIL, we also can't release those records until the data is on disk.

This wasn't a problem before, because the zil_commit() would block. If
we change zil_commit() to return error, we still need to track those
entries until the data they represent hits the disk. We also need to
accept new records; just because the ZIL fails may not necessarily mean
the pool itself is unavailable.

This commit reorganises the ZIL to allow zil_commit() to return failure.
If ZIL writes or flushes fail, the ZIL is moved into a "failed" state,
and no further writes are done; all zil_commit() calls are serviced by
the regular txg mechanism. Outstanding records (itx_ts) are held until
the main pool writes their associated txg out. The records are then
released. Once all records are cleared, the ZIL is reset and reopened.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit af821006f6602261e690fe6635689cabdeefcadf)
2023-07-05 13:27:31 +00:00
Rob Norris 7b7af8ba02 vnops: thread DMU_TX_ASSIGN_CONTINUE to a bunch of vnops
These are ones that I'm reasonably sure connect to a real syscall and
have a reasonable error response.

I've left stuff like `dirty_inode`, `zfs_inactive`, etc, which are
internal kernel housekeeping things, as well as anything that looks like
it belongs to zvols, ioctls, admin commands, etc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 39c2801c611e27b521d716fea8f771307820362e)
2023-07-05 13:27:30 +00:00
Rob Norris 48a48059c7 dmu: rename dmu_tx_assign flags
Their names clash with those for txg_wait_synced_tx, and they aren't
directly compatible, leading to confusion.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 1f0fb1dae7c1e84de3b39e669e09b8b3d5b80b87)
2023-07-05 13:27:30 +00:00
Rob Norris 3aea149bf8 linux: reject syncing ops if the filesystem is unmounting
The kernel can call these during unmount, so we have to handle them
directly to prevent any further IO being issued.

zfs_fsync reorganised slightly to not set up zfs_fsyncer_key until after
the teardown lock is acquired, just in case we don't get it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 900c26570ddcdd1d3ca135e6aee5df6456f6bfd6)
2023-07-05 13:27:30 +00:00
Mariusz Zaborski 40a9efd0e8 zfs: support force exporting pools
This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:
- txg_wait can now take a NOSUSPEND flag, in which case the caller will
  be alerted if their txg can't be committed.  This is primarily of
  interest for callers that would normally pass TXG_WAIT, but don't want
  to wait if the pool becomes suspended, which allows unwinding in some
  cases, specifically when one is attempting a non-forced export.
  Without this, the non-forced export would preclude a forced export
  by virtue of holding the namespace lock indefinitely.
- txg_wait also returns failure for TXG_WAIT users if a pool is actually
  being force exported.  Adjust most callers to tolerate this.
- spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
- DMU objset initiator which may be set on an objset being forcibly
  exported / unmounted.
- SPA export initiator may be set on a pool being forcibly exported.
- DMU send/recv now use an interruption mechanism which relies on the
  SPA export initiator being able to enumerate datasets and closing any
  send/recv streams, causing their EINTR paths to be invoked.
- ZIO now has a cancel entry point, which tells all suspended zios to
  fail, and which suppresses the failures for non-CANFAIL users.
- metaslab, etc. cleanup, which consists of simply throwing away any
  changes that were not able to be synced out.
- Linux specific: introduce a new tunable,
  zfs_forced_export_unmount_enabled, which allows the filesystem to
  remain in a modified 'unmounted' state upon exiting zpl_umount_begin,
  to achieve parity with FreeBSD and illumos,
  which have VFS-level support for yanking filesystems out from under
  users.  However, this only helps when the user is actively performing
  I/O, while not sitting on the filesystem.  In particular, this allows
  test #3 below to pass on Linux.
- Add basic logic to zpool to indicate a force-exporting pool, instead
  of crashing due to lack of config, etc.

Add tests which cover the basic use cases:
- Force export while a send is in progress
- Force export while a recv is in progress
- Force export while POSIX I/O is in progress

This change modifies the libzfs ABI:
- New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value.
- New field libzfs_force_export for libzfs_handle.

Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by:  Klara, Inc.
Sponsored-by:  Catalogics, Inc.
Sponsored-by:  Wasabi Technology, Inc.
Closes #3461
(cherry picked from commit 852e633772217d779a63e8c46fe3c5f81dd8960e)
2023-07-05 13:27:30 +00:00
Brian Behlendorf fec407fb69 Linux 5.19 compat: aops->read_folio()
As of the Linux 5.19 kernel the readpage() address space operation
has been replaced by read_folio().

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-06-01 14:24:49 -07:00
Brian Behlendorf 7ae5ea8864 Linux 5.19 compat: blkdev_issue_secure_erase()
Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure
erase functionality from the blkdev_issue_discard() function.
The blkdev_issue_secure_erase() must now be issued to issue
a secure erase.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-06-01 14:24:49 -07:00
Brian Behlendorf 048301b6dc Linux 5.19 compat: bdev_max_secure_erase_sectors()
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function.  The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-06-01 14:24:49 -07:00
Brian Behlendorf 9ce5eb18ef Linux 5.19 compat: bdev_max_discard_sectors()
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function.  The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-06-01 14:24:49 -07:00
Brian Behlendorf 5a639f0802 Linux 5.18 compat: bio_alloc()
As for the Linux 5.18 kernel bio_alloc() expects a block_device struct
as an argument.  This removes the need for the bio_set_dev() compatibility
code for 5.18 and newer kernels.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-06-01 14:24:49 -07:00
hping b28c0c4bf8 abd_os: remove redundant refcount creation for abd_children
Refcount creation for abd_zero_scatter->abd_children is redundant in
abd_alloc_zero_scatter, as it has been done in abd_init_struct.

In addition, abd_children is undefined when ZFS_DEBUG is disabled, the
reference of abd_children in abd_alloc_zero_scatter breaks build of
libzpool when ZFS_DEBUG is disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes #13429
2022-05-20 10:33:24 -07:00
Aidan Harris eee389ba2e Fix functions without a prototype
clang-15 emits the following error message for functions without
a prototype:

fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error:
  a function declaration without a prototype is deprecated
  in all versions of C [-Werror,-Wstrict-prototypes]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aidan Harris <me@aidanharr.is>
Closes #13421
2022-05-20 10:33:24 -07:00
Mateusz Guzik 2c5c8bb0a6 FreeBSD: use zero_region instead of allocating a dedicated page
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #13406
2022-05-20 10:33:24 -07:00
szubersk 756c3e085b autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol
A followup to 849c14e048
Fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009242

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #13389
2022-05-20 10:33:24 -07:00
Damian Szuberski 13b1f336d3 PPC get_user workaround
Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic()
inline helpers very indirectly include a reference to the GPL'd
array mmu_feature_keys[] and fails to build. Workaround this by
using copy_from_user() and throwing EFAULT for any calls to
__copy_from_user_inatomic(). This is a workaround until a fix
for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458
"powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #11958
Closes #12590
Closes #13367
2022-05-20 10:33:24 -07:00
Brian Atkinson 60fc173251 Adding ZERO_PAGE detection
On some architectures ZERO_PAGE is unavailable because it references
a GPL exported symbol of empty_zero_page. Originally e08b993 removed
the call to PAGE_ZERO(0) for assignment to the abd_zero_page. However,
a simple check can be done to avoid a kernel allocation and free for
the abd_zero_page if ZERO_PAGE is available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #13199
2022-05-20 10:33:24 -07:00
Ka Ho Ng 1f31889046 FreeBSD: Implement hole-punching support
This adds supports for hole-punching facilities in the FreeBSD kernel
starting from __FreeBSD_version 1400032.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ka Ho Ng <khng@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #12458
2022-05-17 11:15:29 -07:00
наб ecec151c14 module: zfs: freebsd: fix unused, remove argsused
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12844
2022-05-02 15:42:58 -07:00
наб a4f582f0b6 FreeBSD: remove unused variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12899
2022-05-02 15:42:58 -07:00
наб b8e1366ee6 zvol: remove unused variable
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12917
2022-05-02 15:42:58 -07:00
Brian Behlendorf 49c1346c10
Linux 5.18 compat: replace __set_page_dirty_nobuffers
Replace __set_page_dirty_nobuffers with filemap_dirty_folio.

Upstream-commit: 6b1f86f8e9c7f9de7ca1cb987b2cf25e99b1ae3a
("Merge tag 'folio-5.18b' of
git://git.infradead.org/users/willy/pagecache ")

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Satadru Pramanik <satadru@gmail.com>
Closes #13325
Closes #13380
2022-04-28 15:17:38 -07:00
Brian Behlendorf 71cd3726c0
Fix O_APPEND for Linux 3.15 and older kernels
When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks().  The updated pos variable  was incorrectly
ignored resulting in the current offset being used.

This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13370 
Closes #13377
2022-04-28 15:15:28 -07:00
наб 642426095a Linux 5.18 compat: kobj_type.default_attrs replaced with default_groups
Upstream-commit: cdb4f26a63c391317e335e6e683a614358e70aeb ("kobject:
 kobj_type: remove default_attrs")
Upstream-commit: 0cdda2edb3
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13357
2022-04-25 10:00:09 -07:00
Alexander Motin 972637dc06 FreeBSD: Fix translation from ABD to physical pages.
In hypothetical case of non-linear ABD with single segment, multiple
to page size but not aligned to it, vdev_geom_fill_unmap_cb() could
fill one page less into bio_ma array.

I am not sure it is expoitable, but better to be safe than sorry.

Reported-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit 5352f85cdd)
2022-04-21 16:59:09 -07:00
Rich Ercolani c220771a47 Corrected oversight in ZERO_RANGE behavior
It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.

Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #13329 
Closes #13338
2022-04-21 16:58:07 -07:00
Brian Behlendorf aa1c3c1d1d Linux 5.17 compat: GENHD_FL_EXT_DEVT / GENHD_FL_NO_PART_SCAN
As of the 5.17 kernel the GENHD_FL_EXT_DEVT flag has been removed
and the GENHD_FL_NO_PART_SCAN flag renamed GENHD_FL_NO_PART. Update
zvol_alloc() to set GENHD_FL_NO_PART for the newer kernels which
is sufficient.  The behavior for prior kernels remains unchanged.

1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
46e7eac6 ("block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART")

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13294
Closes #13297
2022-04-20 13:44:19 -07:00
Mark Johnston b7546f92ea FreeBSD: Return Mach error codes from VOP_(GET|PUT)PAGES
FreeBSD's memory management system uses its own error numbers and gets
confused when these VOPs return EIO.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #13311
2022-04-19 10:42:54 -07:00
Mark Johnston e9cd90f6e5 FreeBSD: Parameterize ZFS_ENTER/ZFS_VERIFY_VP with an error code
For legacy reasons, a couple of VOPs have to return error numbers that
don't come from the usual errno namespace.  To handle the cases where
ZFS_ENTER or ZFS_VERIFY_ZP fail, we need to be able to override the
default error return value of EIO.  Extend the macros to permit this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #13311
2022-04-19 10:42:54 -07:00
Riccardo Schirone 35ddd8ee2e Linux 5.18 compat: use address_space_operations->readahead
->readpages was removed and replaced by ->readahead. Define
zpl_readahead for kernels that don't have ->readpages.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Riccardo Schirone <rschirone91@gmail.com>
Closes #13278
2022-04-06 13:15:27 -07:00
Riccardo Schirone 10a9f5fc47 Linux 5.18 compat: blkg_tryget is moved to private headers
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Riccardo Schirone <rschirone91@gmail.com>
Closes #13278
2022-04-06 13:15:27 -07:00