Commit Graph

1057 Commits

Author SHA1 Message Date
Chunwei Chen 7043281906 Linux 4.7 compat: use iterate_shared for concurrent readdir
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.

Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4664
Closes #4665
2016-09-05 16:07:08 -07:00
Chunwei Chen 1aff4bb235 Linux 4.7 compat: replace blk_queue_flush with blk_queue_write_cache
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4665
2016-09-05 16:07:08 -07:00
Chunwei Chen 703c9f5893 Remove dummy znode from zvol_state
struct zvol_state contains a dummy znode, which is around 1KB on x64,
only for zfs_range_lock. But in reality, other than z_range_lock and
z_range_avl, zfs_range_lock only need znode on regular file, which
means we add 1KB on a structure and gain nothing.

In this patch, we remove the dummy znode for zvol_state. In order to
do that, we also need to refactor zfs_range_lock a bit. We move
z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t.
This new struct replaces znode_t as the main handle inside the range
lock functions.

We also add pointers to z_size, z_blksz, and z_max_blksz so range lock
code doesn't depend on znode_t.  This allows non-ZPL consumers like
Lustre to use the range locks with their equivalent znode_t structure.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4510
2016-09-05 16:07:08 -07:00
Brian Behlendorf efde19487c Fix ztest truncated cache file
Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
truncate and overwrite rather than rename the cache file.  This is
the correct fix but it should have only been applied for the kernel
build.  In user space rename(2) is needed because ztest depends on
the cache file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4129
2016-09-05 16:07:08 -07:00
AndCycle 347cdb6e61 Obey arc_meta_limit default size when changing arc_max
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit.  Otherwise, the arc_meta_limit
may be set the same as arc_max.

Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4001
2016-09-05 16:07:08 -07:00
Tim Chase 52475b507a Enable PF_FSTRANS for ioctl secpolicy callbacks (#4571)
At the very least, the zfs_secpolicy_write_perms ioctl security policy
callback, which calls dsl_dataset_hold(), can require freeing memory and,
therefore, re-enter ZFS.  This patch enables PF_FSTRANS for all of the
security policy callbacks similarly to the manner in which it's enabled
for the actual ioctl callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4554
2016-05-06 18:22:34 -04:00
Brian Behlendorf 2cb77346cb Use udev for partition detection
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time.  In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked.  However, if the rare event it is needed
it will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #4523
Closes #3708
Closes #4077
Closes #4144
Closes #4214
Closes #4517
2016-05-06 18:22:34 -04:00
Chunwei Chen 21ea9460fa Remove wrong ASSERT in annotate_ecksum
When using large blocks like 1M, there will be more than UINT16_MAX qwords in
one block, so this ASSERT would go off. Also, it is possible for the histogram
to overflow. We cap them to UINT16_MAX to prevent this.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4257
2016-05-01 18:29:51 -04:00
Colin Ian King 354424de5a Add support 32 bit FS_IOC32_{GET|SET}FLAGS compat ioctls
We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS
compat ioctls for systems such as powerpc64.  We use the normal
compat ioctl idiom as used by a variety of file systems to provide
this support.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4477
2016-05-01 18:26:09 -04:00
Brian Behlendorf d746e2ea0e Linux 4.6 compat: PAGE_CACHE_SIZE removal
As described in torvalds/linux@4a2d057e the macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced
to make it possible to add bigger chunks to the page cache.  This
never panned out and it has therefore been removed from the kernel.

ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros
and calls to page_cache_release() have been replaced with put_page().

There was no need to introduce a configure check for this because
these interfaces have existed for a very long time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4489
2016-05-01 18:24:30 -04:00
Colin Ian King 60a4ea3f94 Fix inverted logic on none elevator comparison
Commit d1d7e2689d ("cstyle: Resolve C style issues") inverted
the logic on the none elevator comparison.  Fix this and make it
cstyle warning clean.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4507
2016-05-01 18:23:29 -04:00
Ned Bass 6400ae85ee Fix ZPL miswrite of default POSIX ACL
Commit 4967a3e introduced a typo that caused the ZPL to store the
intended default ACL as an access ACL. Due to caching this problem
may not become visible until the filesystem is remounted or the inode
is evicted from the cache. Fix the typo.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4520
2016-05-01 18:19:54 -04:00
Chunwei Chen 31dbe4b404 Linux 4.5 compat: Use xattr_handler->name for acl
Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to
whole name rather than prefix should use "name" instead of "prefix".
Otherwise, kernel will return with EINVAL when it tries to resolve handlers.

Also, we remove the strcmp checks when xattr_handler has name, because
xattr_resolve_name will do the check for us.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4549
Closes #4537
2016-05-01 18:17:57 -04:00
Boris Protopopov d0337e80ca Fix lock order inversion with zvol_open()
zfsonlinux issue #3681 - lock order inversion between zvol_open() and
dsl_pool_sync()...zvol_rename_minors()

Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3681
2016-03-22 18:08:04 -07:00
Boris Protopopov 9d02f9557f Add support for asynchronous zvol minor operations
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset

zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()

Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.

* Each taskq must be single threaded to ensure tasks are always
  processed in the order in which they were dispatched.

* There is a taskq per-pool in order to keep the pools independent.
  This way if one pool is suspended it will not impact another.

* The preferred location to dispatch a zvol minor task is a sync
  task.  In this context there is easy access to the spa_t and
  minimal error handling is required because the sync task must
  succeed.

Support for asynchronous zvol minor operations address issue #3681.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2217
Closes #3678
Closes #3681
2016-03-22 18:08:04 -07:00
Boris Protopopov 682be6e0c9 Make zvol minor functionality more robust
Close the race window in zvol_open() to prevent removal of
zvol_state in the 'first open' code path. Move the call to
check_disk_change() under zvol_state_lock to make sure the
zvol_media_changed() and zvol_revalidate_disk() called by
check_disk_change() are invoked with positive zv_open_count.

Skip opened zvols when removing minors and set private_data
to NULL for zvols that are not in use whose minors are being
removed, to indicate to zvol_open() that the state is gone.
Skip opened zvols when renaming minors to avoid modifying
zv_name that might be in use, e.g. in zvol_ioctl().

Drop zvol_state_lock before calling add_disk() when creating
minors to avoid deadlocks with zvol_open().

Wrap dmu_objset_find() with spl_fstran_mark()/unmark().

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4344
2016-03-22 17:54:07 -07:00
Richard Sharpe 82ff881071 Handling negative dentries in a CI file system.
For a Case Insensitive file system we must avoid creating negative
entries in the dentry cache. We must also pass the FIGNORECASE into
zfs_lookup so that special files are handled correctly.

We must also prevent negative dentries from being created when files are
unlinked.

Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.)

Also tested with printks (now removed) to ensure that lookups come to
zpl_lookup when negative should not exist.

Tests:
1.   ls Some-file.txt; touch some-file.txt; ls Some-file.txt
  and ensure no errors.

2.   touch Some-file.txt; rm some-file.txt; ls Some-file.txt
  and ensure that the last ls shows log messages showing the lookup
  went all the way to zpl_lookup.

Thanks to tuxoko for helping me get this correct.

Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4243
2016-03-22 17:54:07 -07:00
Richard Sharpe a99c845fdc Fix casesensitivity=insensitive deadlock
When casesensitivity=insensitive is set for the
file system, we can deadlock in a rename if the user uses different case
for each path. For example rename("A/some-file.txt", "a/some-file.txt").

The simple test for this is:

1. mkdir some-dir in a ZFS file system
2. touch some-dir/some-file.txt
3. mv Some-dir/some-file.txt some-dir/some-other-file.txt

This last request deadlocks trying to relock the i_mutex on the inode for
the parent directory.

The solution is to use d_add_ci in zpl_lookup if we are on a file system
that has the casesensitivity=insensitive attribute set.

This patch checks if we are working on a case insensitive file system and if
so, allocates storage for the case insensitive name and passes it to
zfs_lookup and then calls d_add_ci instead of d_splice_alias.

The performance impact seems to be minimal even though we have introduced a
kmalloc and kfree in the lookup path.

The problem was found when running Microsoft's FSCT against Samba on top of
ZFS On Linux.

Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4136
2016-03-22 17:54:07 -07:00
Paul Dagnelie 63ce7b6fcf Illumos 6370 - ZFS send fails to transmit some holes
6370 ZFS send fails to transmit some holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6370
  https://github.com/illumos/illumos-gate/commit/286ef71

In certain circumstances, "zfs send -i" (incremental send) can produce
a stream which will result in incorrect sparse file contents on the
target.

The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).

Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).

Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).

We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).

The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).

Ported-by: Steven Burgess <sburgess@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4369
Closes #4050
2016-03-15 10:01:48 -07:00
Brian Behlendorf 9842008fc0 Linux 4.5 compat: xattr list handler
The registered xattr .list handler was simplified in the 4.5 kernel
to only perform a permission check.  Given a dentry for the file it
must return a boolean indicating if the name is visible.  This
differs slightly from the previous APIs which also required the
function to copy the name in to the provided list and return its
size.  That is now all the responsibility of the caller.

This should be straight forward change to make to ZoL since we've
always required the caller to make the copy.  However, this was
slightly complicated by the need to support 3 older APIs.  Yes,
between 2.6.32 and 4.5 there are 4 versions of this interface!

Therefore, while the functional change in this patch is small it
includes significant cleanup to make the code understandable and
maintainable.  These changes include:

- Improved configure checks for .list, .get, and .set interfaces.
  - Interfaces checked from newest to oldest.
  - Strict checking for each possible known interface.
  - Configure fails when no known interface is available.
  - HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency
    with similar iops and fops configure checks.

- POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to
  move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}.
  Compatibility wrapper were added for old kernels.

- ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing
  ZPL_XATTR_{GET|SET} WRAPPERs.  Only the inode is guaranteed to be
  a valid pointer, passing NULL for the 'list' and 'name' variables
  is allowed and must be checked for.  All .list functions were
  updated to use the wrapper to aid readability.

- zpl_xattr_filldir() updated to use the .list function for its
  permission check which is consistent with the updated Linux 4.5
  interface.  If a .list function is registered it should return 0
  to indicate a name should be skipped, if there is no registered
  function the name will be added.

- Additional documentation from xattr(7) describing the correct
  behavior for each namespace was added before the relevant handlers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
2016-01-29 09:52:13 -08:00
Brian Behlendorf b3c9e2caf5 Linux 4.5 compat: get_link() / put_link()
The follow_link() interface was retired in favor of get_link().
In the process of phasing in get_link() the Linux kernel went
through two different versions.  The first of which depended
on put_link() and the final version on a delayed done function.

- Improved configure checks for .follow_link, .get_link, .put_link.
  - Interfaces checked from newest to oldest.
  - Strict checking for each possible known interface.
  - Configure fails when no known interface is available.

- Both versions .get_link are detected and supported as well
  two previous versions of .follow_link.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
2016-01-29 09:52:13 -08:00
Tim Chase 6f7acfc9c9 Prevent arc_c collapse
Adjusting arc_c directly is racy because it can happen in the context
of multiple threads.  It should always be >= 2 * maxblocksize.  Set it
to a known valid value rather than adjusting it directly.

In addition refactor arc_shrink() to a simpler structure, protect against
underflow in the calculation of the new arc_c value.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reverts: 935434ef
Closes: #3904
Closes: #4161
2016-01-29 09:52:12 -08:00
Chunwei Chen a08add3067 Prevent duplicated xattr between SA and dir
When replacing an xattr would cause overflowing in SA, we would fallback
to xattr dir. However, current implementation don't clear the one in SA,
so we would end up with duplicated SA.

For example, running the following script on an xattr=sa filesystem
would cause duplicated "user.1".

-- dup_xattr.sh begin --
randbase64()
{
        dd if=/dev/urandom bs=1 count=$1 2>/dev/null | openssl enc -a -A
}

file=$1
touch $file
setfattr -h -n user.1 -v `randbase64 5000` $file
setfattr -h -n user.2 -v `randbase64 20000` $file
setfattr -h -n user.3 -v `randbase64 20000` $file
setfattr -h -n user.1 -v `randbase64 20000` $file
getfattr -m. -d $file
-- dup_xattr.sh end --

Also, when a filesystem is switch from xattr=sa to xattr=on, it will
never modify those in SA. This would cause strange behavior like, you
cannot delete an xattr, or setxattr would cause duplicate and the result
would not match when you getxattr.

For example, the following shell sequence.

-- shell begin --
$ sudo zfs set xattr=sa pp/fs0
$ touch zzz
$ setfattr -n user.test -v asdf zzz
$ sudo zfs set xattr=on pp/fs0
$ setfattr -x user.test zzz
setfattr: zzz: No such attribute
$ getfattr -d zzz
user.test="asdf"
$ setfattr -n user.test -v zxcv zzz
$ getfattr -d zzz
user.test="asdf"
user.test="asdf"
-- shell end --

We fix this behavior, by first finding where the xattr resides before
setxattr. Then, after we successfully updated the xattr in one location,
we will clear the other location. Note that, because update and clear
are not in single tx, we could still end up with duplicated xattr. But
by doing setxattr again, it can be fixed.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3472
Closes #4153
2016-01-29 09:52:12 -08:00
Brian Behlendorf 0a2f95748d Close possible zfs_znode_held() race
Check if the lock is held while holding the z_hold_locks() lock.
This prevents a possible use-after-free bug for callers which are
not holding the lock.  There currently are no such callers so this
can't cause a problem today but it has been fixed regardless.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4244
Issue #4124
2016-01-29 09:52:12 -08:00
Brian Behlendorf 3b9fd93d0b Fix zsb->z_hold_mtx deadlock
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4106
2016-01-29 09:52:12 -08:00
Brian Behlendorf 05c3401e3f Add zfs_object_mutex_size module option
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #4106
2016-01-29 09:52:12 -08:00
Ned Bass a5dae61721 Prevent SA length overflow
The function sa_update() accepts a 32-bit length parameter and
assigns it to a 16-bit field in sa_bulk_attr_t, potentially
truncating the passed-in value. This could lead to corrupt system
attribute (SA) records getting written to the pool. Add a VERIFY to
sa_update() to detect cases where overflow would occur. The SA length
is limited to 16-bit values by the on-disk format defined by
sa_hdr_phys_t.

The function zfs_sa_set_xattr() is vulnerable to this bug if the
unpacked nvlist of xattrs is less than 64k in size but the packed
size is greater than 64k. Fix this by appropriately checking the
size of the packed nvlist before calling sa_update(). Add error
handling to zpl_xattr_set_sa() to keep the cached list of SA-based
xattrs consistent with the data on disk.

Lastly, zfs_sa_set_xattr() calls dmu_tx_abort() on an assigned
transaction if sa_update() returns an error, but the DMU only allows
unassigned transactions to be aborted. Wrap the sa_update() call in a
VERIFY0, remove the transaction abort, and call dmu_tx_commit()
unconditionally. This is consistent practice with other callers
of sa_update().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4150
2016-01-29 09:41:14 -08:00
Chunwei Chen d621aa5431 Make xattr dir truncate and remove in one tx
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4114
Issue #4052
Issue #4006
Issue #3018
Issue #2861
2015-12-30 16:13:37 -08:00
Chunwei Chen 19d991a99e Fix empty xattr dir causing lockup
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4114
Closes #4052
Closes #4006
Closes #3018
Closes #2861
2015-12-30 16:13:30 -08:00
Chunwei Chen 0bf37725f8 Fix fail path in zfs_znode_alloc
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().

Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.

We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
2015-12-23 17:37:20 -08:00
Brian Behlendorf 3445a340d5 Fix z_xattr_lock/z_teardown_lock inversion
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock.  Resolve this by taking the z_teardown_lock in
all registered xattr callbacks prior to taking the z_xattr_lock.
This ensures the locks are always taken is the same order thus
preventing a deadlock.  Note the z_teardown_lock is taken again
in zfs_lookup() and this is safe because the z_teardown lock is
a re-entrant read reader/writer lock.

* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
  __zpl_xattr_get
    zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro

* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
  zfs_resume_fs
    zfs_rezget -> Takes zp->z_xattr_lock

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #3943
Closes #3969
Closes #4121
2015-12-23 17:37:08 -08:00
Brian Behlendorf a0dba38cd4 Follow 0/-E convention for module load errors
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned.  This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter.  This is what was causing the hard failures
in the automated testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-23 17:29:35 -08:00
tuxoko a3fcc7c48b Fix zfs_dirty_data_max overflow on 32-bit
On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow,
causing it to be smaller than zfs_dirty_data_sync, and will cause txg being
delayed while no one write to disk. The end result is horrendous write speed.

On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it
can reach speed on par with 64-bit VM.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3973
2015-12-23 17:29:35 -08:00
tuxoko becc31dda7 Fix null pointer in arc_kmem_reap_now on 32-bit
On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL.
So we need to avoid reaping them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3973
2015-12-23 17:29:35 -08:00
tuxoko 13a9527913 Prevent rm modules.* when make install
This was originally in fe0ed8f910, but somehow
was changed and not working anymore. And it will cause the following error:

modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4027
2015-12-23 17:29:34 -08:00
Brian Behlendorf 627b35a68d Add zap_prefetch() interface
Provide a generic interface to prefetch ZAP entries by name.  This
functionality is being added for external consumers such as Lustre.
It is based of the existing zap_prefetch_uint64() version which is
used by the deduplication code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4061
2015-12-23 17:29:34 -08:00
Brian Behlendorf 8bdac257b8 Fix zfsctl_lookup_objset() deadlock
The zfsctl_snapshot_unmount_delay() function must not be called
from zfsctl_lookup_objset() while it is currently holding the
zfs_snapshot_lock.  This will result in a deadlock.  It is safe
to call zfsctl_snapshot_unmount_delay_impl() directly because the
function already has a reference on the zfs_snapentry_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3997
2015-12-23 17:29:34 -08:00
Brian Behlendorf cb98d1ef27 Hold the zfs_snapentry_t before dispatch
While exceptionally unlikely to cause a problem the zfs_snapentry_t
hold should be taken before the dispatch to prevent any possibility
of the task being processed before the hold.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-23 17:29:34 -08:00
Chunwei Chen 813a4af70e Fix snapshot automount race cause EREMOTE
When a concorrent mount finishes just before calling to
zfsctl_snapshot_ismounted, if we return EISDIR, the VFS will return
with EREMOTE. We should instead just return 0, so VFS may retry and
would likely notice the dentry is alreadly mounted. This will be
inline with when usermode helper return EBUSY.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-23 17:29:34 -08:00
Brian Behlendorf e16c04d643 Change zfs_snapshot_lock from mutex to rw lock
By changing the zfs_snapshot_lock from a mutex to a rw lock the
zfsctl_lookup_objset() function can be allowed to run concurrently.
This should reduce the latency of fh_to_dentry lookups in ZFS
snapshots which are being accessed over NFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-23 17:29:34 -08:00
Chunwei Chen 9b41d9c1b2 Use spa as key besides objsetid for snapentry
objsetid is not unique across pool, so using it solely as key would cause
panic when automounting two snapshot on different pools with the same
objsetid. We fix this by adding spa pointer as additional key.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3948
Issue #3786
Issue #3887
2015-12-23 17:29:34 -08:00
Chunwei Chen c5e30a0ff9 Fix snapshot automount behavior when concurrent or fail
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.

Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.

The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4018
2015-12-23 17:29:34 -08:00
Brian Behlendorf 279e27db23 Set 'zfs_expire_snapshot=0' to disable auto-unmount
There are cases where it's desirable that auto-mounted snapshots
not expire after a fixed duration.  They should be unmounted only
when the filesystem they are a snapshot of is unmounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-23 17:29:34 -08:00
Brian Behlendorf 5f4004efc0 Fix vdev_queue_aggregate() deadlock
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3808
Closes #3867
2015-12-23 17:29:34 -08:00
DHE f7dfb8b07a Make zio_taskq_batch_pct user configurable
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4110
2015-12-23 17:29:34 -08:00
Chunwei Chen 15126e5d08 Linux 4.4 compat: xattr operations takes xattr_handler
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-04 14:59:10 -08:00
Chunwei Chen e909a45d22 Linux 4.4 compat: make_request_fn returns blk_qc_t
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-04 14:58:32 -08:00
Justin T. Gibbs f9f5394f74 Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6267
  https://github.com/illumos/illumos-gate/commit/d205810

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ned Bass <bass6@llnl.gov>
Issue #3865
Issue #3443
2015-10-13 15:32:16 -07:00
Chunwei Chen cd887ab869 Fix use-after-free in vdev_disk_physio_completion
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.

We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-10-13 15:31:44 -07:00
Chunwei Chen eecb43fe8e Fix uioskip crash when skip to end
When doing uioskip to skip an iovec to the very end, the current loop
condition will falsely check pass the end of iovec. We fix this checking
uio_iovcnt first.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3806
Closes #3850
2015-09-29 15:27:14 -07:00