This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.
As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.
This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.
It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.
The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#4801
Explicitly cast type in splat-rwlock.c test case to silence
the following warning.
warning: format ‘%ld’ expects argument of type ‘long int’,
but argument N has type ‘int’
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#574
In the default case the function must return to avoid dereferencing
'prov_mech' which will be NULL.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: candychencan <chen.can2@zte.com.cn>
Closes#5134
coverity scan CID:147531,type: Argument cannot be negative
- may copy data with negative size
coverity scan CID:147532,type: resource leaks
- may close a fd which is negative
coverity scan CID:147533,type: resource leaks
- may call pwrite64 with a negative size
coverity scan CID:147535,type: resource leaks
- may call fdopen with a negative fd
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5176
Cppcheck 1.63 erroneously complains about an uninitialized value
in buf_init(). Newer versions of cppcheck (1.72) handle this
correctly but we'll initialize the value anyway to silence the
warning.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5203
Avoid calculating (1<<64) if lh_prefix_len == 0. Semantics of the method remain
the same.
Assert (lh_prefix_len > 0) in zap_expand_leaf() to detect possibly the same
problem.
Issue #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Explicitly promote variables to correct type. Undefined behavior is
reported because length of int is not well defined by C standard.
Issue #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Undefined operation is reported by running ztest (or zloop) compiled with GCC
UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection
with large enough offset values. Logically, left shift operation would work,
but bit shift semantics in C, and limitation of uint64_t, do not produce desired
result.
Issue #5059, #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Without plugging, the default 'noop' scheduler will not merge
the BIOs which are part of a large ZIO.
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#5181
Refactor the code in such a way so that inode->i_mode is being set
at the same time zp->z_mode is being changed. This has the effect of
keeping both in sync without relying on zfs_inode_update.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes#5158
In arc_state_fini() the `arc_l2c_only->arcs_list[*]` multilists
must be destroyed. This accidentally regressed in d3c2ae1c.
Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5151Closes#5152
We must not use d_add_ci if the dentry already has the real name. Otherwise,
d_add_ci()->d_alloc_parallel() will find itself on the lookup hash and wait
on itself causing deadlock.
Tested-by: satmandu
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5124Closes#5141Closes#5147Closes#5148
coverity scan CID:147633,type: sizeof not portable
coverity scan CID:147637,type: sizeof not portable
coverity scan CID:147638,type: sizeof not portable
coverity scan CID:147640,type: sizeof not portable
In these particular cases sizeof (XX **) happens to be equal to sizeof (X *),
but this is not a portable assumption.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5144
dbuf_read_impl() returns (SET_ERROR(err)) when err can be 0, which adds
lots of noise in tracing logs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#4430Closes#5146
The type of "adjustmnt" was erroneously changed to unsigned when the compressed
ARC code was ported in d3c2ae1c08.
As a result of it being unsigned, the balanced metadata eviction logic
would evict all of the non-metadata.
Reviewed-by: Chris Severance <github.severach@spamgourmet.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: David Quigley <david.quigley@intel.com>
Signed-off-by: Tim Chase <tim@onlight.com>
Closes#5128Closes#5129
In order to support ABD with large blocks the spl_kmem_alloc_warn
limit needs to be increased to 64K.
A 16M block requires that pointers be stored for 4096 4K-pages
on an x86_64 system. Each of these pointers is 8 bytes requiring
an allocation of 8*4096=32,768 bytes. The addition of a small
header to this structure pushes the allocation over the default
32K warning threshold.
In addition, fix a small bug where MAX was used instead of MIN
when setting the default. This ensures a reasonable limit is
still set on systems with page sizes larger then 4K.
Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#571
Enable ignore_hole_birth by default until all known hole birth bugs
have been resolved and relevant test cases added.
Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4809Closes#5099
Simplify time handling in zfs_setattr by mimicking the logic in
setattr_copy from the linux kernel. In order to achieve this
in the case when ZFS' log is being replayed it is necessary
to unconditionally set the ctime in zfs_replay_setattr.
Also use the timespec_trunc function when assigning values to the
generic inode struct. This is currently a noop since zfs sets
s_time_gran to 1, however in the future rules about precision might
change.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes#4916
ZFS doesn't provide a custom update_time method meaning it delegates
this job to the generic VFS layer. The only time when it needs to
set the various *time values is when the inode is being marshalled
to/from the disk. Do this by moving the relevant code from
zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result
from this change it is no longer necessary to have multiple versions
of the zfs_inode_update function - so just nuke them and leave only
one.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Issue #227Closes#4916
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.
I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.
Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.
Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.
When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.
OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
Several assignments to arc_c had no effect because it is ultimately
initialized to arc_c_max.
This aligns ZoL better with the upstream code which removed these
assignments some time ago.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@onlight.com>
Closes#5081
In case sav->sav_config was NULL the body of the function
would skip the iteration of the l2 cache devices and will
just cleanup the old devices. However, this wasn't very obvious
since the null check was performed after the loop body and after
the old devices were cleaned. Refactor the code so that it's now
obvious when the iteration of the l2cache devices is skipped.
This fixes the following cppcheck warning:
[module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes#5087
Since they're allocated with spa_strdup(), they should be freed with
spa_strfree() so the proper length buffer is freed.
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5082Closes#5086
This first phase brings over the ZFS SLM module, zfs_mod.c, to handle
auto operations in response to disk events. Disk event monitoring is
provided from libudev and generates the expected payload schema for
zfs_mod. This work leverages the recently added devid and phys_path
strings in the vdev label.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4673
These allocations can never fail. Leaving the error handling
code here gives the impression they can so it has been removed.
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5048
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5033
`zpool get guid,freeing,leaked` shows SOURCE as `default`, it should
be `-` as those props are not editable.
Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it
stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4170
zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7
this is declaration remains even though the implementation was
removed in commit 278bee93. Removed fastreboot_disable_highpil
which is also unused.
Signed-off-by: caoxuewen cao.xuewen@zte.com.cn
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5042
From user perspective, I would expect that ZFS is always able
to remove files and directories even when the quota is exceeded.
Authored by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6940
OpenZFS-issue: https://www.illumos.org/issues/6334
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916Closes#5044
For quite some time I was thinking about possibility to prefetch
ZFS indirection tables while doing sequential reads or writes.
Recent changes in predictive prefetcher made that much easier to
do. My tests on zvol with 16KB block size on 5x striped and 2x
mirrored pool of 10 disks show almost double throughput on sequential
read, and almost tripple on sequential rewrite. While for read alike
effect can be received from increasing maximal prefetch distance
(though at higher memory cost), for rewrite there is no other
solution so far.
Authored by: Alexander Motin <mav@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6322
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413Closes#5040
Porting notes:
- Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due
to commit 5f6d0b6 'Handle block pointers with a corrupt logical size'
- Difference from upstream in module/zfs/dmu_zfetch.c,
uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance
- Variables have been initialized at the beginning of the function
(void dmu_zfetch) to resemble the order of occurrence and account
for C99, C11 mode errors.
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at
the db_blkptr, to prevent it from being changed by syncing context.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7086
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317Closes#5039
This fix resolves warnings reported during compiling with different gcc
optimization levels in debug mode,
Test tools:
gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago)
List of warnings:
CFLAGS=-O1 ./configure --enable-debug ;make
../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’:
../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function
../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here
CFLAGS=-Os ./configure --enable-debug ; make
libzfs_dataset.c: In function ‘zfs_prop_set_list’:
libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5022
ARC will evict meta buffers that exceed the arc_meta_limit. Before a further
investigating on whether we should take special protection on meta buffers,
this tunable make arc_meta_limit adjustable for different workloads.
People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko,
so some range check is added to guarantee a suitable arc_meta_limit.
Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style
tunable as well.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4957
As is the case with traverse_prefetch_thread(), the deep stacks caused
by traversal require disabling reclaim in the send traverse thread.
Also, do the same for receive_writer_thread() in which similar problems
have been observed.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4912Closes#4998
API Change: Module parameter set/get methods take const parameter in
Grsecurity kernel v4.7.1
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4997Closes#5001
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:
dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()
This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.
Sponsored by: Intel Corp.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7004
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109Closes#4641Closes#4972
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7003
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108Closes#4972
When spa retry load succeeds and spa recovery is requested it may
leak in spa_load_best function. Always free the generated config
when it is not assigned to the spa.
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4940
When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all
the buffers alloced by kmem_alloc() and kmem_zalloc(). If a NULL
pointer which indicates no track info in SPL is passed to
spl_kmem_free_track, we just ignore it.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4967
Closes#567
This is another bug in the long line of hole-birth related issues. In
this particular case, it was discovered that a previous hole-birth fix
(illumos bug 6513, commit bc77ba73) did not cover as many cases as we
thought it did. While the issue worked in the case of hole-punching
(writing zeroes to a large part of a file), it did not deal with
truncation, and then writing beyond the new end of the file.
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7176
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad
Upstream-bugs: DLPX-46009
Porting notes:
- Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ;
keep previous position of the initialization
Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a
considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to
`kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`,
which is around 2KiB. This structure is used as a stack, to limit the size of
the C stack as dbuf_hold() calls itself recursively. We make a recursive call
to hold the parent's dbuf when the requested dbuf is not found. The vast
majority of the time, the parent or grandparent indirect dbuf is cached, so the
number of recursive calls is very low. However, we initialize this entire
array for every call to dbuf_hold().
To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()`
instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all
members of the struct before they are used. I observed ~5% performance
improvement on a workload which creates many files.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4974
- Benchmark memory block is increased to 128kiB to reflect real block sizes more
accurately. Measurements include all three stages needed for checksum generation,
i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset
overhead of time function.
- Fastest implementation selects native and byteswap methods independently in
benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()`
are introduced.
- Implementation mutex lock is replaced by atomic variable.
- To save time, benchmark is not executed in userspace. Instead, highest supported
implementation is used for fastest. Default userspace selector is still 'cycle'.
- `fletcher_4_native/byteswap()` methods use incremental methods to finish
calculation if data size is not multiple of vector stride (currently 64B).
- Added `fletcher_4_native_varsize()` special purpose method for use when buffer size
is not known in advance. The method does not enforce 4B alignment on buffer size, and
will ignore last (size % 4) bytes of the data buffer.
- Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows
throughput for all supported implementations (in B/s), native and byteswap,
as well as the code [fastest] is running.
Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`:
implementation native byteswap
scalar 4768120823 3426105750
sse2 7947841777 4318964249
ssse3 7951922722 6112191941
avx2 13269714358 11043200912
fastest avx2 avx2
Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`:
implementation native byteswap
scalar 1291115967 1031555336
sse2 2539571138 1280970926
ssse3 2537778746 1080016762
avx2 4950749767 1078493449
avx512f 9581379998 4010029046
fastest avx512f avx512f
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4952
Adds a module option which disables the hole_birth optimization
which has been responsible for several recent bugs, including
issue #4050.
Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4833
Import a raidz pool which has a vdev with a bad label, zpool status
shows the right state of the dev, but the wrong state of the pool.
The pool state should be DEGRADED, not ONLINE.
We examine the label in vdev_validate while in spa_load_impl, the bad
label can be detected but doesn't propagate its state to the parent.
There are other chances to propagate state in the following vdev_load
if we failed to load DTL, but our pool is raidz1 which can tolerate a
faulted disk. So we lost the last chance to correct the pool state.
Propagate the leaf vdev's state to parent if its label was corrupted,
as is done elsewhere in vdev_validate.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#4948
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283
Porting Notes:
In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.
Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux. In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.
The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.
ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync
ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy
ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid
ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove
ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear
ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check
ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare
ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand
ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start
ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish
ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start
ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish
ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
If there is no explicit note in the .S files, the obj file will mark it
as requiring an executable stack. This is unneeded and causes issues on
hardened systems.
More info:
https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4947Closes#4962
The constify plugin will automatically constify a class of types that contain
only function pointers. The icp structs fail to build if this is enabled with
the following error. The no_const attribute makes the plugin skip those
structs.
module/icp/spi/kcf_spi.c: In function ‘copy_ops_vector_v1’:
module/icp/spi/kcf_spi.c:61:16: error: assignment of read-only location ‘*dst_ops->cou.cou_v1.co_control_ops’
*((dst)->ops) = *((src)->ops);
^
module/icp/spi/kcf_spi.c:74:2: note: in expansion of macro ‘KCF_SPI_COPY_OPS’
KCF_SPI_COPY_OPS(src_ops, dst_ops, co_control_ops);
^
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4947Closes#4962
nvlist_pack() and nvlist_unpack are implemented recursively, which can
cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist
which contains an nvlist, which contains an nvlist, which...
Unprivileged users can pass an nvlist to the kernel via certain ioctls
on /dev/zfs, which the kernel will unpack without additional permission
checking or validation. Therefore, an unprivileged user can cause the
kernel's stack to overflow and panic.
Ideally, these functions would be implemented non-recursively. As a
quick fix, this patch limits the depth of the recursion and returns an
error when attempting to pack and unpack a deeply-nested nvlist.
Signed-off-by: Adam Leventhal <ahl@delphix.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/7263
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d
-
Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").
This problem crashes system when use zfs as a layer of overlayfs.
Signed-off-by: Chen Haiquan <oc@yunify.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4914Closes#4935
The indefinite article before nvlist should be "an", not "a".
We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should
stay the same as we are such a strict filesystem.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4941
Non-Linux OpenZFS implementations require additional support to be
used a root pool. This code should simply be removed to avoid
confusion and improve readability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4951
All users of bio->bi_rw have been replaced with compatibility wrappers.
This allows the kernel specific logic to be abstracted away, and for
each of the supported cases to be documented with the wrapper. The
updated interfaces are as follows:
* void blk_queue_set_write_cache(struct request_queue *, bool, bool)
* boolean_t bio_is_flush(struct bio *)
* boolean_t bio_is_fua(struct bio *)
* boolean_t bio_is_discard(struct bio *)
* boolean_t bio_is_secure_erase(struct bio *)
* VDEV_WRITE_FLUSH_FUA
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4951
This fix resolves warnings reported during compiling of user-space
libraries with different gcc optimization levels.
Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora).
The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast.
List of warnings:
[GCC 4.9.2 -Os]
libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 4.9.2 -Og]
fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 4.9.2 -Ofast]
u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 6.1.1]
Resolved in PR #4907
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4937
Starting from Linux 4.7, get_acl will set acl cache pointer to temporary
sentinel value before calling i_op->get_acl. Therefore we can't compare
against ACL_NOT_CACHED and return.
Since from Linux 3.14, get_acl already check the cache for us, so we
disable this in zpl_get_acl.
Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4944Closes#4946
The posix_acl_valid() function has been updated to require a
user namespace. Filesystem callers should normally provide the
user_ns from the super block associcated with the ACL; the
zpl_posix_acl_valid() wrapper has been added for this purpose.
See https://github.com/torvalds/linux/commit/0d4d717f for
complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4922
Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING
configure checks, all supported kernel provide this functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4922
* When the uid/gid change is handled in zfs_setattr we want to
actually adjust the user passed uid to a KUID and write that to disk.
* In trace points use the i_uid member without doing translation,
since it has already been performed.
* Use kuid in zfs_aclset_common
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4928
When arc_max is increased, arc_meta_limit will not be updated to 3/4
of the new arc_c_max value. This was done originally to preserve any
existing maximum value. This turned out to be counter intuitive to
users and this fix changes that behavior. If zfs_arc_meta_limit is
non-default, it will be picked up later in the ARC tuning function.
Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4893
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is
detected by gcc in metaslab_alloc(). Resolve the warning by passing
a physical size of 0 to BP_SET_BIRTH() as it done by other callers.
module/zfs/metaslab.c: In function ‘metaslab_alloc’:
module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates
to true [-Werror=tautological-compare]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4907
Fix a possible VDEV statistics array overflow when ZIOs with
ZIO_PRIORITY_NOW complete.
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4883Closes#4917
Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time
zfs_inode_update_impl is called. Since this value never changes
move its assignment to at inode creation time.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4906
The newly added icp module uses a hardcoded value of CDDL for the license,
however in local development one might want to change that to something
else in order to facilitate compiling against lock debugging enabled kernel.
All modules of the zfs use the ZFS_META_LICNSE string which is replaced with
the value held in the META file. One can modify the value in the META file
once and then rerun the configure to have all modules' licenses changed.
Change the icp module license string to be ZFS_META_LICENSE so that it
falls under the same paradigm.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4905
In zfs_ioc_log_history() function the tsd_set() function is called
with NULL which causes the zfs_allow_log_destroy() to be run. In
this case the passed value will be NULL. This is normally entirely
safe because strfree() maps directly to kfree() which may be passed
a NULL. However, since alternate implementations of strfree() may
not handle this gracefully add a check for NULL.
Observed under an embedded Linux 2.6.32.41 kernel running the
automated testing while running the ZFS Test Suite.
Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4872
New REQ_OP_* definitions have been introduced to separate the
WRITE, READ, and DISCARD operations from the flags. This included
changing the encoding of bi_rw. It places REQ_OP_* in high order
bits and other stuff in low order bits. This encoding is done
through the new helper function bio_set_op_attrs. For complete
details refer to:
https://github.com/torvalds/linux/commit/f215082https://github.com/torvalds/linux/commit/4e1b2d5
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4892Closes#4899
The rw argument has been removed from submit_bio/submit_bio_wait.
Callers are now expected to set bio->bi_rw instead of passing it
in. See https://github.com/torvalds/linux/commit/4e49ea4a for
complete details.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type. A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro. See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#563
The memory allocation and locking in `spa_txg_history_*()` can
potentially block txg_hold_open for arbitrarily long periods of time.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4333
Here's the problem - on 4K native devices in userland on
Linux using O_DIRECT, buffers must be 4K aligned or I/O
will fail with EINVAL, causing zdb (and others) to coredump.
Since userland probably doesn't need optimized buffer caches,
we just force 4K alignment on everything.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#4479
DMU_MAX_ACCESS should be cast to a uint64_t otherwise the
multiplication of DMU_MAX_ACCESS with spa_asize_inflation will
be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS
is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will
lead to an overflow.
Found by static analysis with CoverityScan 0.8.5
CID 150942 (#1 of 1): Unintentional integer overflow
(OVERFLOW_BEFORE_WIDEN)
overflow_before_widen: Potentially overflowing expression
67108864 * spa_asize_inflation with type int (32 bits, signed)
is evaluated using 32-bit arithmetic, and then used in a context
that expects an expression of type uint64_t (64 bits, unsigned).
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4889
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773Closes#4858
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer. This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.
In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b. In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.
The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency. This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.
To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.
For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4858
Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.
module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
avl_insert(tree, new_node, where);
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4267
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.
In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4867
Issue #4854
Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4860
- Implementation lock replaced with atomic variable
- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`
- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813
- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392
- Minor fixes and cleanups
- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes. This means it could
destroy the inode while we are still using it.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4854
In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue
was fixed in gcc 4.6.
To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.
Disable zimport testing against master where this flaw exists:
TEST_ZIMPORT_VERSIONS="installed"
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4862
It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'. Try to remove the hidden child (%recv) and after
that try to remove the target dataset. If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4818
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.
The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.
Adds a pair of fields to an existing zcommon module parameter:
- zfs_fletcher_4_impl (str)
"sse2" - new SSE2 implementation if available
"ssse3" - new SSSE3 implementation if available
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4789
A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.
We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.
In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4838
Issue #227
Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4846
In arc_buf_info(), the arc_buf_t may have no header. If not, don't try
to fetch the arc buffer stats and instead just zero them.
The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4837
zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4829
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.
Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
db_blkptr = NULL and it's dirtied.
Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
xattr fits into the bonus buffer, so it's removed. The dbuf is
undirtied in this txg, but it's still referenced and cannot be
destroyed.
Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
(not done yet).
* A new change makes the spill buffer necessary again.
sa_build_layouts() ends up calling dbuf_find() to locate the
dbuf. It finds the old dbuf because it has not been destroyed yet
(it will be destroyed when the previous write is done and there
are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
referenced, so it's not destroyed.
Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
directly copied into the dnode, overwriting the blkptr area because,
in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
gets corrupted.
Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3937
zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.
This change partially reverts e89260a. This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.
P1 P2
--- ---
deactivate_locked_super(): s_active = 0
zfs_sb_hold()
->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
zfs_sb_rele(zsb)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.
Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4799
The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned. This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations. Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4330
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.
OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e
Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface. The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.
ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler. Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6
Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f
Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3542
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates. As summarized by
@mahrens in #4636:
"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects). When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.
The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."
Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>a
Ported by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0
If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.
Fix is to modify the dbuf_read code to fill in birth time data.
The commit f74b821 caused a regression where creating file through NFS will
always create a file owned by root. This is because the patch enables the KSID
code in zfs_acl_ids_create, which it would use euid and egid of the current
process. However, on Linux, we should use fsuid and fsgid for file operations,
which is the original behaviour. So we revert this part of code.
The patch also enables secpolicy_vnode_*, since they are also used in file
operations, we change them to use fsuid and fsgid.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4772Closes#4758
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4328
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned. The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg. This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably. This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: #4726
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands. In addition, non-
privileged users should be able to run all of the following commands:
* zpool [list | iostat | status | get]
* zfs [list | get]
Historically this functionality was not available on Linux. In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability. Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.
Even with this change some limitations remain. Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace). This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code. It
may be possible to add this functionality in the future.
This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite. These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.
Two minor bug fixes were required for test-running.py. First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it. Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.
Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#362Closes#434Closes#4100Closes#4394Closes#4410Closes#4487
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.
New zcommon module parameters:
- zfs_fletcher_4_impl (str): selects the implementation to use.
"fastest" - use the fastest version available
"cycle" - cycle trough all available impl for ztest
"scalar" - use the original version
"avx2" - new AVX2 implementation if available
Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar: 4216 MB/s
- AVX2: 14499 MB/s
See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4330
The policy is to try to allocate with KM_NOSLEEP, which will lead to
memory allocation with GFP_ATOMIC, and if it fails, it will launch
an taskq to expand slab space.
This way it should be able to get better NUMA memory locality and
reduce the overhead of context switch.
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#551
This splat_vprint is using tq_arg->name after tq_arg is freed.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#557
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.
This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes#554
Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization. This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().
George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.
However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes(). This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.
Signed-off-by: GeLiXin <47034221@qq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4652
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130
Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent
the underlying zsb from disappearing if there's a concurrent umount. We fix
this by force the caller of arc_remove_prune_callback to wait for
arc_prune_taskq to finish.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4687Closes#4690
wait_event is a macro, so the current implementation will cause re-
evaluation of tq_next_id every time it wakes up. This would cause
taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it
does not implies the thread being spawned is up and running. This will cause
taskq to be freed before the thread can exit.
We fix this by using tq_nspawn to indicate how many threads are being spawned
before they are inserted to the thread list. And have taskq_destroy to wait
for it to drop to zero.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553Closes#550
In 39cd90e, I mistakenly disabled the ability of using absolute expire time in
cv_timedwait_hires. I don't quite sure why I did that, so let's restore it.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4514Closes#4661Closes#4672
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.
Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4664Closes#4665
Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
inode_lock_shared to do exclusive and shared lock respectively.
We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
kernel you'll always take an exclusive lock.
We also add all other inode_lock friends. And nested users now should
explicitly call spl_inode_lock_nested with correct subclass.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4665
Closes#549
This field is a duplicate of the inode->i_generation, so just
kill it.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4538Closes#4654
This reverts commit 8302528617 and
ebecfcd699 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b541574.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
Fixed bug introduced in commit #c35b1882. Hinted by gcc:
zio_inject.c: In function ‘zio_handle_io_delay’:
zio_inject.c:382:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
if (handler->zi_record.zi_freq != 0 &&
^~
zio_inject.c:384:4: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
continue;
^~~~~~~~
Signed-off-by: Marcel Huber <marcelhuberfoo@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4632
struct zvol_state contains a dummy znode, which is around 1KB on x64,
only for zfs_range_lock. But in reality, other than z_range_lock and
z_range_avl, zfs_range_lock only need znode on regular file, which
means we add 1KB on a structure and gain nothing.
In this patch, we remove the dummy znode for zvol_state. In order to
do that, we also need to refactor zfs_range_lock a bit. We move
z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t.
This new struct replaces znode_t as the main handle inside the range
lock functions.
We also add pointers to z_size, z_blksz, and z_max_blksz so range lock
code doesn't depend on znode_t. This allows non-ZPL consumers like
Lustre to use the range locks with their equivalent znode_t structure.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4510
The was originally using interruptible cv_timedwait_sig, but was changed
to uninterruptible cv_timedwait_hires in ae6d0c6. Use _sig_hires instead
to allow interruptible sleep.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4633Closes#4634
6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6093
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0f92170Closes#4630
This function was always implemented slightly differently under Linux
and therefore never suffered from this issue. The patch has been
updated and applied as cleanup in order to minimize differences with
the upstream OpenZFS code.
This reverts commit 4cd77889b6. The
i_generation field in the inode is 32-bit and the SA code expects
64-bit fixed values. Revert this optimization for now until
this is cleanly addressed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4538
The receive_writer_arg and receive_arg structures become large
when ZFS is compiled with debugging enabled. This results in
gcc throwing an error about excessive stack usage:
module/zfs/dmu_send.c: In function ‘dmu_recv_stream’:
module/zfs/dmu_send.c:2502:1: error: the frame size of 1256 bytes is
larger than 1024 bytes [-Werror=frame-larger-than=]
Fix this by allocating those functions on the heap, rather than
on the stack.
With patch: dmu_send.c:2350:1:dmu_recv_stream 240 static
Without patch: dmu_send.c:2350:1:dmu_recv_stream 1336 static
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4620
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Denys Rtveliashvili <denys@rtveliashvili.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
An initial version of this patch was applied in commit 29572cc and
subsequently refined upstream. Since the implementations do not
conflict with each other both are left applied for now.
OpenZFS-issue: https://www.illumos.org/issues/6842
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/02525cdCloses#4615
Commit e0ab3ab introduced two blocks of code which are only needed
when debugging is enabled. These blocks should be wrapped with
ZFS_DEBUG for clarity and to prevent unused variable warnings in
a production build.
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4515
6286 ZFS internal error when set large block on bootfs
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6286
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6de9bb5Closes#4585
6736 ZFS per-vdev ZAPs
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/6736https://github.com/openzfs/openzfs/commit/215198a
Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4515
This field is a duplicate of the inode->i_generation, so just kill it
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4538
At the very least, the zfs_secpolicy_write_perms ioctl security policy
callback, which calls dsl_dataset_hold(), can require freeing memory and,
therefore, re-enter ZFS. This patch enables PF_FSTRANS for all of the
security policy callbacks similarly to the manner in which it's enabled
for the actual ioctl callback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4554
These files get generated when CONFIG_DEBUG_INFO_DWARF4 is enabled in
Linux .config.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4580
6844 dnode_next_offset can detect fictional holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
dnode_next_offset is used in a variety of places to iterate over the
holes or allocated blocks in a dnode. It operates under the premise that
it can iterate over the blockpointers of a dnode in open context while
holding only the dn_struct_rwlock as reader. Unfortunately, this premise
does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer
in the indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the
bp is now inconsistent from the perspective of dnode_next_offset: the bp
will appear to be a hole until zio_dva_allocate finally finishes filling
it in. In the meantime, dnode_next_offset can detect a hole in the dnode
when none exists.
I was able to experimentally demonstrate this behavior with the
following setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the
first L0 block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the
middle of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent
hole.
The fix is to ensure that it is valid to iterate over the indirect
blocks in a dnode while holding the dn_struct_rwlock by passing the zio
a copy of the BP and updating the actual BP in dbuf_write_ready while
holding the lock.
References:
https://www.illumos.org/issues/6844https://github.com/openzfs/openzfs/pull/82
DLPX-35372
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4548
6659 nvlist_free(NULL) is a no-op
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6659https://github.com/illumos/illumos-gate/commit/aab83bb
Ported-by: David Quigley <dpquigl@davequigley.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4566
The problem described in 2a5d574 also applies to XFS's file or inode
fallocate method. Both paths may trigger writeback and expose this
issue, see the full stack below.
When layered on XFS a warning will be emitted under CentOS7 when entering
either the file or inode fallocate method with PF_FSTRANS already set.
To avoid triggering this error PF_FSTRANS is cleared and then reset
in vn_space().
WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0
Call Trace:
[<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0
[<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
[<ffffffff81173ed7>] __writepage+0x17/0x40
[<ffffffff81176f81>] write_cache_pages+0x251/0x530
[<ffffffff811772b1>] generic_writepages+0x51/0x80
[<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs]
[<ffffffff81177300>] do_writepages+0x20/0x30
[<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100
[<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0
[<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs]
[<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs]
[<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl]
[<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs]
[<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs]
[<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs]
[<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl]
[<ffffffff810c1777>] kthread+0xd7/0xf0
[<ffffffff8167de2f>] ret_from_fork+0x3f/0x70
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closeszfsonlinux/zfs#4529
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks. This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc. Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.
Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable. At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.
In order to address this the zpool_label_disk_wait() function
has been updated to use libudev. Until the registered system
device acknowledges that it in fully initialized the function
will wait. Once fully initialized all device links are checked
and allowed to settle for 50ms. This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.
For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time. In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked. However, if the rare event it is needed
it will prevent a failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#4523Closes#3708Closes#4077Closes#4144Closes#4214Closes#4517
Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to
whole name rather than prefix should use "name" instead of "prefix".
Otherwise, kernel will return with EINVAL when it tries to resolve handlers.
Also, we remove the strcmp checks when xattr_handler has name, because
xattr_resolve_name will do the check for us.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4549Closes#4537
In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and
pn_free() functions must be implemented. The existing illumos
implementation were used for this purpose.
The `flags` argument which was used in places wrapped by the
HAVE_PN_UTILS condition has beed added back to zfs_remove() and
zfs_link() functions. This removes a small point of divergence
between the ZoL code and upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4522
Commit 4967a3e introduced a typo that caused the ZPL to store the
intended default ACL as an access ACL. Due to caching this problem
may not become visible until the filesystem is remounted or the inode
is evicted from the cache. Fix the typo and add a regression test.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4520
Commit d1d7e2689d ("cstyle: Resolve C style issues") inverted
the logic on the none elevator comparison. Fix this and make it
cstyle warning clean.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4507
Linux 4.0 introduces lazytime. The idea is that when we update the atime, we
delay writing it to disk for as long as it is reasonably possible.
When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME
flag whenever i_atime is updated. So under such condition, we will set
z_atime_dirty. We will only write it to disk if file is closed, inode is
evicted or setattr is called. Ideally, we should also write it whenever SA
is going to be updated, but it is left for future improvement.
There's one thing that we should take care of now that we allow i_atime to be
dirty. In original implementation, whenever SA is modified, zfs_inode_update
will be called to overwrite every thing in inode. This will cause dirty
i_atime to be discarded. We fix this by don't overwrite i_atime in
zfs_inode_update. We only overwrite i_atime when allocating new inode or doing
zfs_rezget with zfs_inode_update_new.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
The problem for atime:
We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its
handling is a mess. A huge part of mess regarding atime comes from
zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave
inconsistently with those three values.
zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you
don't pass ATTR_ATIME. Which means every write(2) operation which only updates
ctime and mtime will cause atime changes to not be written to disk.
Also zfs_inode_update from write(2) will replace inode->i_atime with what's
inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2).
You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0.
Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's
inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new),
SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll
leave with a stale atime.
The problem for relatime:
We do have a relatime config inside ZFS dataset, but how it should interact
with the mount flag MS_RELATIME is not well defined. It seems it wanted
relatime mount option to override the dataset config by showing it as
temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would
also seems to want to override the mount option. Not to mention that
MS_RELATIME flag is actually never passed into ZFS, so it never really worked.
How Linux handles atime:
The Linux kernel actually handles atime completely in VFS, except for writing
it to disk. So if we remove the atime handling in ZFS, things would just work,
no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever
VFS updates the i_atime, it will notify the underlying filesystem via
sb->dirty_inode().
And also there's one thing to note about atime flags like MS_RELATIME and
other flags like MS_NODEV, etc. They are mount point flags rather than
filesystem(sb) flags. Since native linux filesystem can be mounted at multiple
places at the same time, they can all have different atime settings. So these
flags are never passed down to filesystem drivers.
What this patch tries to do:
We remove znode->z_atime, since we won't gain anything from it. We remove most
of the atime handling and leave it to VFS. The only thing we do with atime is
to write it when dirty_inode() or setattr() is called. We also add
file_accessed() in zpl_read() since it's not provided in vfs_read().
After this patch, only the MS_RELATIME flag will have effect. The setting in
dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME
set according to the setting in dataset in future patch.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
As described in torvalds/linux@4a2d057e the macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced
to make it possible to add bigger chunks to the page cache. This
never panned out and it has therefore been removed from the kernel.
ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros
and calls to page_cache_release() have been replaced with put_page().
There was no need to introduce a configure check for this because
these interfaces have existed for a very long time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4489
We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS
compat ioctls for systems such as powerpc64. We use the normal
compat ioctl idiom as used by a variety of file systems to provide
this support.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4477
When debugging is enabled on a very recent version of gcc
(tested with 5.3.0), DVA_SET_GANG(dva, !!(flags)) fails
because an assertion causes a comparison between what is
technically a boolean and an integer.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4465
The zpool_scrub_002, zpool_scrub_003, zpool_scrub_004 test cases fail
reliably when running against small pools or fast storage. This
occurs because the scrub/resilver operation completes before subsequent
commands can be run.
A one second delay has been added to 10% of zio's in order to ensure
the scrub/resilver operation will run for at least several seconds.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4450
When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another
thread to be spawned. This will cause TQ_NOQUEUE to behave similarly
as it does with non-dynamic taskqs.
Add support for TQ_NOQUEUE to taskq_dispatch_ent().
Signed-off-by: Tim Chase <tim@onlight.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#530
6681 zfs list burning lots of time in dodefault() via dsl_prop_*
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
References:
https://www.illumos.org/issues/6681https://github.com/illumos/illumos-gate/commit/d09e447
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4406
6370 ZFS send fails to transmit some holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6370https://github.com/illumos/illumos-gate/commit/286ef71
In certain circumstances, "zfs send -i" (incremental send) can produce
a stream which will result in incorrect sparse file contents on the
target.
The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).
Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).
Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).
We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).
The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).
Ported-by: Steven Burgess <sburgess@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4369Closes#4050
This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms. It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held. On other platforms the lock is never released during
the upgrade process. This is necessary under Linux because the kernel
does not provide an upgrade function.
There are currently no callers in the ZFS code where this change in
behavior is a problem. In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes#534
zfsonlinux issue #3681 - lock order inversion between zvol_open() and
dsl_pool_sync()...zvol_rename_minors()
Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3681
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset
zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()
Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.
* Each taskq must be single threaded to ensure tasks are always
processed in the order in which they were dispatched.
* There is a taskq per-pool in order to keep the pools independent.
This way if one pool is suspended it will not impact another.
* The preferred location to dispatch a zvol minor task is a sync
task. In this context there is easy access to the spa_t and
minimal error handling is required because the sync task must
succeed.
Support for asynchronous zvol minor operations address issue #3681.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2217Closes#3678Closes#3681
Since they both evaluate to zero, this is a semi-cosmetic change
but the latter is the proper value to use as an argument to
taskq_dispatch_delay().
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4393
There is a race condition when new transaction group is added
to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize.
Meanwhile, before these dirty data are synchronized, the receive process
can cause that dmu_recv_end_sync is executed. Then finally dirty data
are going to be synchronized but the synchronization ends with the NULL
pointer dereference error.
Signed-off-by: ab-oe <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4116
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
References:
https://github.com/freebsd/freebsd@5c7a6f5dhttps://github.com/freebsd/freebsd@31b7f68dhttps://github.com/freebsd/freebsd@e186f564
Performance Testing:
https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141
Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
collegue Andriy Gapon has been included. It applied perfectly, but
added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4334
Set a limit for the largest compressed block which can be written
to an L2ARC device. By default this limit is set to 16M so there
is no change in behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#4323
Close the race window in zvol_open() to prevent removal of
zvol_state in the 'first open' code path. Move the call to
check_disk_change() under zvol_state_lock to make sure the
zvol_media_changed() and zvol_revalidate_disk() called by
check_disk_change() are invoked with positive zv_open_count.
Skip opened zvols when removing minors and set private_data
to NULL for zvols that are not in use whose minors are being
removed, to indicate to zvol_open() that the state is gone.
Skip opened zvols when renaming minors to avoid modifying
zv_name that might be in use, e.g. in zvol_ioctl().
Drop zvol_state_lock before calling add_disk() when creating
minors to avoid deadlocks with zvol_open().
Wrap dmu_objset_find() with spl_fstran_mark()/unmark().
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4344
The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is
that the former takes a hold while the latter uses an existing hold.
`zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while
our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no
apparent reason, we inherited a `zvol_read()` function from
OpenSolaris that does `dmu_read_uio()`. illumos-gate also still
uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`,
which is more performant.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4316
In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t
rather than bio_t. Since we are translating from bio to uio for both, we
might as well unify the logic and have code more similar to its illumos
counterpart. At the same time, we can fix some regressions that occurred
versus the original code from illumos-gate.
We refactor zvol_write to take uio and also correct the
following problems:
1. We did `dnode_hold()` on each IO when we already had a hold.
2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the
DMU.
3. We could call `zil_commit()` twice. In this case, this is because
Linux uses the `->write` function to send flushes and can aggregate the
flush with a write. If a synchronous write occurred with the flush, we
effectively flushed twice when there is no need to do that.
zvol_read also suffers from the first two problems. Other platforms
suffer from the first, so we leave that for a second patch so that there
is a discrete patch for them to cherry-pick.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4316
When using large blocks like 1M, there will be more than UINT16_MAX qwords in
one block, so this ASSERT would go off. Also, it is possible for the histogram
to overflow. We cap them to UINT16_MAX to prevent this.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4257
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.
The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.
http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf
Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:
http://xorshift.di.unimi.it/#speed
The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.
Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.
We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:
1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.
2. Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.
3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.
4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.
5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.
6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:
https://en.wikipedia.org/wiki/Seqlock
The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`). In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#372
5809 Blowaway full receive in v1 pool causes kernel panic
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5809https://github.com/illumos/illumos-gate/commit/f40b29c
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#529
6537 Panic on zpool scrub with DEBUG kernel
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
References:
https://www.illumos.org/issues/6537https://github.com/illumos/illumos-gate/commit/8c04a1f
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
6096 ZFS_SMB_ACL_RENAME needs to cleanup better
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6096https://github.com/illumos/illumos-gate/commit/8f5190a5
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
6450 scrub/resilver unnecessarily traverses snapshots created
after the scrub started
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/6450https://github.com/illumos/illumos-gate/commit/38d6103
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For a Case Insensitive file system we must avoid creating negative
entries in the dentry cache. We must also pass the FIGNORECASE into
zfs_lookup so that special files are handled correctly.
We must also prevent negative dentries from being created when files are
unlinked.
Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.)
Also tested with printks (now removed) to ensure that lookups come to
zpl_lookup when negative should not exist.
Tests:
1. ls Some-file.txt; touch some-file.txt; ls Some-file.txt
and ensure no errors.
2. touch Some-file.txt; rm some-file.txt; ls Some-file.txt
and ensure that the last ls shows log messages showing the lookup
went all the way to zpl_lookup.
Thanks to tuxoko for helping me get this correct.
Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4243
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/6527https://github.com/illumos/illumos-gate/commit/2bd7a8d
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
6495 Fix mutex leak in dmu_objset_find_dp
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Albert Lee <trisk@omniti.com>
References:
https://www.illumos.org/issues/6495https://github.com/illumos/illumos-gate/commit/2bad225
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6414https://github.com/illumos/illumos-gate/commit/eb5bb58
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Reintroduce a slightly adapted version of the Illumos logic for
synchronous unlinks. The basic idea here is that only files
smaller than zfs_delete_blocks (20480) blocks should be deleted
synchronously. Unlinking larger files should be handled
asynchronously to minimize impact to the caller.
To accomplish this iput() which is responsible for calling
zfs_znode_delete() on Linux is only called in the delete_now
path. Otherwise zfs_async_iput() is used which allows the
last reference to be dropped by a taskq thread effectively
making the removal asynchronous.
Porting notes:
- Add zfs_delete_blocks module option for performance analysis.
The default value is DMU_MAX_DELETEBLKCNT which is the same
as upstream. Reducing this value means that smaller files
will be unlinked asynchronously like large files.
- All occurrences of zfsvfs changes to zsb.
Ported-by: KernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since it's set to arc_c_max / 2, it must be set after arc_c_max is set.
Also added protection against it falling below 2 * maxblocksize in
userland builds.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4268
Adjusting arc_c directly is racy because it can happen in the context
of multiple threads. It should always be >= 2 * maxblocksize. Set it
to a known valid value rather than adjusting it directly.
In addition refactor arc_shrink() to a simpler structure, protect against
underflow in the calculation of the new arc_c value.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reverts: 935434efCloses: #3904Closes: #4161
Previous commit be29e6a updated kobj_read_file() so it no longer
unconditionally passes RLIM64_INFINITY. The vn_rdwr() function
needs to be updated accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #513
I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.
Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#513
4950 files sometimes can't be removed from a full filesystem
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4950https://github.com/illumos/illumos-gate/commit/4bb7380
Porting notes:
- ZoL currently does not log discards to zvols, so the portion of
this patch that modifies the discard logging to mark it as
freeing space has been discarded.
2. may_delete_now had been removed from zfs_remove() in ZoL.
It has been reintroduced.
3. We do not try to emulate vnodes, so the following lines are
not valid on Linux:
mutex_enter(&vp->v_lock);
may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp);
mutex_exit(&vp->v_lock);
This has been replaced with:
mutex_enter(&zp->z_lock);
may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped);
mutex_exit(&zp->z_lock);
Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Check if the lock is held while holding the z_hold_locks() lock.
This prevents a possible use-after-free bug for callers which are
not holding the lock. There currently are no such callers so this
can't cause a problem today but it has been fixed regardless.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4244
Issue #4124
To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#500Closes#504Closes#505
The registered xattr .list handler was simplified in the 4.5 kernel
to only perform a permission check. Given a dentry for the file it
must return a boolean indicating if the name is visible. This
differs slightly from the previous APIs which also required the
function to copy the name in to the provided list and return its
size. That is now all the responsibility of the caller.
This should be straight forward change to make to ZoL since we've
always required the caller to make the copy. However, this was
slightly complicated by the need to support 3 older APIs. Yes,
between 2.6.32 and 4.5 there are 4 versions of this interface!
Therefore, while the functional change in this patch is small it
includes significant cleanup to make the code understandable and
maintainable. These changes include:
- Improved configure checks for .list, .get, and .set interfaces.
- Interfaces checked from newest to oldest.
- Strict checking for each possible known interface.
- Configure fails when no known interface is available.
- HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency
with similar iops and fops configure checks.
- POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to
move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}.
Compatibility wrapper were added for old kernels.
- ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing
ZPL_XATTR_{GET|SET} WRAPPERs. Only the inode is guaranteed to be
a valid pointer, passing NULL for the 'list' and 'name' variables
is allowed and must be checked for. All .list functions were
updated to use the wrapper to aid readability.
- zpl_xattr_filldir() updated to use the .list function for its
permission check which is consistent with the updated Linux 4.5
interface. If a .list function is registered it should return 0
to indicate a name should be skipped, if there is no registered
function the name will be added.
- Additional documentation from xattr(7) describing the correct
behavior for each namespace was added before the relevant handlers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
The follow_link() interface was retired in favor of get_link().
In the process of phasing in get_link() the Linux kernel went
through two different versions. The first of which depended
on put_link() and the final version on a delayed done function.
- Improved configure checks for .follow_link, .get_link, .put_link.
- Interfaces checked from newest to oldest.
- Strict checking for each possible known interface.
- Configure fails when no known interface is available.
- Both versions .get_link are detected and supported as well
two previous versions of .follow_link.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
5045 use atomic_{inc,dec}_* instead of atomic_add_*
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5045https://github.com/illumos/illumos-gate/commit/1a5e258
Porting notes:
- All changes to non-ZFS files dropped.
- Changes to zfs_vfsops.c dropped because they were Illumos specific.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4220
4039 zfs_rename()/zfs_link() needs stronger test for XDEV
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4039https://github.com/illumos/illumos-gate/commit/18e6497
Porting notes:
- This check was updated in Linux in a similar fashion early on in
the port. Therefore, this patch just reorders the function and
updates the comment so it flows the same way as the upstream code.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4218
3557 dumpvp_size is not updated correctly when a dump zvol's size is changed
3558 setting the volsize on a dump device does not return back ENOSPC
3559 setting a volsize larger than the space available sometimes succeeds
3560 dumpadm should be able to remove a dump device
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
https://www.illumos.org/issues/3559https://github.com/illumos/illumos-gate/commit/c61ea56
Porting notes:
- Internal zvol.c changes not applied due to implementation differences.
The external interface and behavior was already consistent with the
latest upstream code.
- Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All
supported kernels (2.6.32 and newer) provide this interface.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4217
When replacing an xattr would cause overflowing in SA, we would fallback
to xattr dir. However, current implementation don't clear the one in SA,
so we would end up with duplicated SA.
For example, running the following script on an xattr=sa filesystem
would cause duplicated "user.1".
-- dup_xattr.sh begin --
randbase64()
{
dd if=/dev/urandom bs=1 count=$1 2>/dev/null | openssl enc -a -A
}
file=$1
touch $file
setfattr -h -n user.1 -v `randbase64 5000` $file
setfattr -h -n user.2 -v `randbase64 20000` $file
setfattr -h -n user.3 -v `randbase64 20000` $file
setfattr -h -n user.1 -v `randbase64 20000` $file
getfattr -m. -d $file
-- dup_xattr.sh end --
Also, when a filesystem is switch from xattr=sa to xattr=on, it will
never modify those in SA. This would cause strange behavior like, you
cannot delete an xattr, or setxattr would cause duplicate and the result
would not match when you getxattr.
For example, the following shell sequence.
-- shell begin --
$ sudo zfs set xattr=sa pp/fs0
$ touch zzz
$ setfattr -n user.test -v asdf zzz
$ sudo zfs set xattr=on pp/fs0
$ setfattr -x user.test zzz
setfattr: zzz: No such attribute
$ getfattr -d zzz
user.test="asdf"
$ setfattr -n user.test -v zxcv zzz
$ getfattr -d zzz
user.test="asdf"
user.test="asdf"
-- shell end --
We fix this behavior, by first finding where the xattr resides before
setxattr. Then, after we successfully updated the xattr in one location,
we will clear the other location. Note that, because update and clear
are not in single tx, we could still end up with duplicated xattr. But
by doing setxattr again, it can be fixed.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#3472Closes#4153
The fast write mutex is intended to protect accounting, but it is
redundant because all accounting is performed through atomic operations.
It also serializes all metaslab IO behind a mutex, which introduces a
theoretical scaling regression that the Illumos developers did not like
when we showed this to them. Removing it makes the selection of the
metaslab_group lock free as it is on Illumos. The selection is not quite
the same without the lock because the loop races with IO completions,
but any imbalances caused by this are likely to be corrected by
subsequent metaslab group selections.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3643
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed. This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist. Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.
In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held. In
zfs_znode_hold_exit() the process is reversed. The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.
This scheme has two important properties:
1) No memory allocations are performed while holding one of the z_hold_locks.
This ensures evict(), which can be called from direct memory reclaim, will
never block waiting on a z_hold_locks which just happens to have hashed
to the same index.
2) All locks used to serialize access to an object are per-object and never
shared. This minimizes lock contention without creating a large number
of dedicated locks.
On the downside it does require znode_lock_t structures to be frequently
allocated and freed. However, because these are backed by a kmem cache
and very short lived this cost is minimal.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4106
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array. Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix. This patch is primarily to aid debugging and analysis.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #4106
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest. This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler. Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.
$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...
loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4215
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.
We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.
This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.
Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106
3749 zfs event processing should work on R/O root filesystems
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3749https://github.com/illumos/illumos-gate/commit/3cb69f7
Porting notes:
- [include/sys/spa_impl.h]
- ffe9d38 Add generic errata infrastructure
- 1421c89 Add visibility in to arc_read
- [include/sys/fm/fs/zfs.h]
- 2668527 Add linux events
- 6283f55 Support custom build directories and move includes
- [module/zfs/spa_config.c]
- Updated spa_config_sync() to match illumos with the exception
of a Linux specific block.
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5438 zfs_blkptr_verify should continue after zfs_panic_recover
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5438https://github.com/illumos/illumos-gate/commit/5897eb4
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
6171 dsl_prop_unregister() slows down dataset eviction.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/6171https://github.com/illumos/illumos-gate/commit/03bad06
Porting notes:
- Conflicts
- 3558fd7 Prototype/structure update for Linux
- 2cf7f52 Linux compat 2.6.39: mount_nodev()
- 13fe019 Illumos #3464
- 241b541 Illumos 5959 - clean up per-dataset feature count code
- dsl_prop_unregister() preserved until out of tree consumers
like Lustre can transition to dsl_prop_unregister_all().
- Fixing 'space or tab at end of line' in include/sys/dsl_dataset.h
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6288https://github.com/illumos/illumos-gate/commit/0f2e7d0
Porting notes:
- [module/zfs/dbuf.c]
- Fix 'warning: ISO C90 forbids mixed declarations and code'
by moving 'dbuf_dirty_record_t *dr' to start of code block.
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
6292 exporting a pool while an async destroy is running can leave
entries in the deferred tree
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
References:
https://www.illumos.org/issues/6292https://github.com/illumos/illumos-gate/commit/a443cc8
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix build failure accidentally introduced by 1715493. This only
results in a failure when debugging is disabled.
dsl_dataset.c: In function 'dsl_dataset_stats':
dsl_dataset.c:1698:45: error: 'dp' undeclared (first use in this function)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
4929 want prevsnap property
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4929https://github.com/illumos/illumos-gate/commit/b461c74
Porting notes:
- [include/sys/fs/zfs.h]
- f67d70 Create an 'overlay' property
- 11b9ec Add full SELinux support
- [fs/zfs/dsl_dataset.c]
- This increases the stack size of dsl_dataset_stats() but
nothing has been changed until this is shown to be an issue.
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3749 zfs event processing should work on R/O root filesystems
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3749https://github.com/illumos/illumos-gate/commit/3cb69f7
Porting notes:
- [include/sys/spa_impl.h]
- ffe9d38 Add generic errata infrastructure
- 1421c89 Add visibility in to arc_read
- [include/sys/fm/fs/zfs.h]
- 2668527 Add linux events
- 6283f55 Support custom build directories and move includes
- [module/zfs/spa_config.c]
- Updated spa_config_sync() to match illumos with the exception
of a Linux specific block.
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix an off by one error introduced by fcff0f3 which triggers an
assertion when 16M blocks are used with send/recv. This fix was
intentionally not folder in to the Illumos commit so it can be
easily cherry-picked by upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When casesensitivity=insensitive is set for the
file system, we can deadlock in a rename if the user uses different case
for each path. For example rename("A/some-file.txt", "a/some-file.txt").
The simple test for this is:
1. mkdir some-dir in a ZFS file system
2. touch some-dir/some-file.txt
3. mv Some-dir/some-file.txt some-dir/some-other-file.txt
This last request deadlocks trying to relock the i_mutex on the inode for
the parent directory.
The solution is to use d_add_ci in zpl_lookup if we are on a file system
that has the casesensitivity=insensitive attribute set.
This patch checks if we are working on a case insensitive file system and if
so, allocates storage for the case insensitive name and passes it to
zfs_lookup and then calls d_add_ci instead of d_splice_alias.
The performance impact seems to be minimal even though we have introduced a
kmalloc and kfree in the lookup path.
The problem was found when running Microsoft's FSCT against Samba on top of
ZFS On Linux.
Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4136
3139 zdb dies when it tries to determine path of unlinked file
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://github.com/illumos/illumos-gate/commit/1ce39b5https://www.illumos.org/issues/3139
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function sa_update() accepts a 32-bit length parameter and
assigns it to a 16-bit field in sa_bulk_attr_t, potentially
truncating the passed-in value. This could lead to corrupt system
attribute (SA) records getting written to the pool. Add a VERIFY to
sa_update() to detect cases where overflow would occur. The SA length
is limited to 16-bit values by the on-disk format defined by
sa_hdr_phys_t.
The function zfs_sa_set_xattr() is vulnerable to this bug if the
unpacked nvlist of xattrs is less than 64k in size but the packed
size is greater than 64k. Fix this by appropriately checking the
size of the packed nvlist before calling sa_update(). Add error
handling to zpl_xattr_set_sa() to keep the cached list of SA-based
xattrs consistent with the data on disk.
Lastly, zfs_sa_set_xattr() calls dmu_tx_abort() on an assigned
transaction if sa_update() returns an error, but the DMU only allows
unassigned transactions to be aborted. Wrap the sa_update() call in a
VERIFY0, remove the transaction abort, and call dmu_tx_commit()
unconditionally. This is consistent practice with other callers
of sa_update().
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4150
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4114
Issue #4052
Issue #4006
Issue #3018
Issue #2861
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.
We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4114Closes#4052Closes#4006Closes#3018Closes#2861
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock. Resolve this by taking the z_teardown_lock in
all registered xattr callbacks prior to taking the z_xattr_lock.
This ensures the locks are always taken is the same order thus
preventing a deadlock. Note the z_teardown_lock is taken again
in zfs_lookup() and this is safe because the z_teardown lock is
a re-entrant read reader/writer lock.
* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
__zpl_xattr_get
zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro
* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
zfs_resume_fs
zfs_rezget -> Takes zp->z_xattr_lock
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#3943Closes#3969Closes#4121
Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
truncate and overwrite rather than rename the cache file. This is
the correct fix but it should have only been applied for the kernel
build. In user space rename(2) is needed because ztest depends on
the cache file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4129
The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.
This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark
such mutexes when they are initialized. This mutex type causes
attempts to take or release those locks to be wrapped in lockdep_off()
and lockdep_on() calls to silence the dependency checker and allow the
use of lock_stats to examine contention.
For RW locks, it uses an analogous lock type, RW_NOLOCKDEP.
The goal is that these locks are ultimately changed back to type
MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect
their relationship (e.g. z_name_lock below) or any real problem with the
lock dependencies are fixed.
Some of the affected locks are:
tc_open_lock:
=============
This is an array of locks, all with same name, which txg_quiesce must
take all of in order to move txg to next state. All default to the same
lockdep class, and so to lockdep appears recursive.
zp->z_name_lock:
================
In zfs_rmdir,
dzp = znode for the directory (input to zfs_dirent_lock)
zp = znode for the entry being removed (output of zfs_dirent_lock)
zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp
zfs_rmdir() takes z_name_lock in zp
Since both dzp and zp are type znode_t, the locks have the same default
class, and lockdep considers it a possible recursive lock attempt.
l->l_rwlock:
============
zap_expand_leaf() sometimes creates two new zap leaf structures, via
these call paths:
zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open()
zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf()
Because both zap_leaf_open() and zap_create_leaf() initialize
l->l_rwlock in their (separate) leaf structures, the lockdep class is
the same, and the linux kernel believes these might both be the same
lock, and emits a possible recursive lock warning.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3895
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4110
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline. This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread. However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.
To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed. This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer. Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly. Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.
* z_wr_iss
zio_vdev_io_start
vdev_queue_io -> Takes vq->vq_lock
vdev_queue_io_to_issue
vdev_queue_aggregate
zio_buf_alloc -> Waiting on spl_kmem_cache process
* z_wr_int
zio_vdev_io_done
vdev_queue_io_done
mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss
* txg_sync
spa_sync
dsl_pool_sync
zio_wait -> Waiting on zio being handled by z_wr_int
* spl_kmem_cache
spl_cache_grow_work
kv_alloc
spl_vmalloc
...
evict
zpl_evict_inode
zfs_inactive
dmu_tx_wait
txg_wait_open -> Waiting on txg_sync
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3808Closes#3867
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.
Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.
This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#515
Issue zfsonlinux/zfs#4111
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock. Detect this case and return EBUSY so zfs_resume_fs()
will mark the inode stale and it can be safely revalidated on next
access.
* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
__zpl_xattr_get
zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro
* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
zfs_resume_fs
zfs_rezget -> Takes zp->z_xattr_lock
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#3969
This patch provides 2 new kstats to display task queues:
/proc/spl/taskqs-all - Display all task queues
/proc/spl/taskqs - Display only "active" task queues
A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.
If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.
If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.
If the task queue has any waiters, displays each waiting task's pid.
Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
Since uio now supports bvec, we can convert bio into uio and reuse
dmu_{read,write}_uio. This way, we can remove some duplicate code.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4078
Userspace can freely pass in whatever iovec it feels like, and it's perfectly
legal to pass an iovec which contains a zero length segment. In the current
implementation, uio_prefaultpages would touch an out of bound byte in the
"last byte" logic. While this probably wouldn't cause any critical error, we
would like uio_prefaultpages to be able to continue gracefully.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4078
If a bit were cleared in `bp->blk_birth` such that the txg birth
was now lower than any other txg_birth in the deadlist, then there
will be no entry before this in the tree.
This should be impossible but regardless error handling code has
been added for this case. By default this is left as a fatal case
and the blk_birth is logged. However, setting `zfs_recover=1` will
cause the bp to be placed at the start of the deadlist even though
it contains an invalid blk_birth.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4086Closes#4089
Commit 5f6d0b6 was originally added to gracefully handle block
pointers with a damaged logical size. However, it incorrectly
assumed that all passed arc_done_func_t could handle a NULL
arc_buf_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4069Closes#4080
While exceptionally unlikely to cause a problem the zfs_snapentry_t
hold should be taken before the dispatch to prevent any possibility
of the task being processed before the hold.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
When a concorrent mount finishes just before calling to
zfsctl_snapshot_ismounted, if we return EISDIR, the VFS will return
with EREMOTE. We should instead just return 0, so VFS may retry and
would likely notice the dentry is alreadly mounted. This will be
inline with when usermode helper return EBUSY.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By changing the zfs_snapshot_lock from a mutex to a rw lock the
zfsctl_lookup_objset() function can be allowed to run concurrently.
This should reduce the latency of fh_to_dentry lookups in ZFS
snapshots which are being accessed over NFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
The zfsctl_snapshot_unmount_delay() function must not be called
from zfsctl_lookup_objset() while it is currently holding the
zfs_snapshot_lock. This will result in a deadlock. It is safe
to call zfsctl_snapshot_unmount_delay_impl() directly because the
function already has a reference on the zfs_snapentry_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#3997
There are cases where it's desirable that auto-mounted snapshots
not expire after a fixed duration. They should be unmounted only
when the filesystem they are a snapshot of is unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
This patch only addresses the issues identified by the style checker.
It contains no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#506
When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking. This is
a false positive.
One such call chain is as follows, when a taskq needs more threads:
taskq_dispatch->taskq_thread_spawn->taskq_dispatch
The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads. The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock. Without subclassing, lockdep believes these
could potentially be the same lock and complains. A similar case occurs
when taskq_dispatch() then calls task_alloc().
This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:
subclass taskq
TQ_LOCK_DYNAMIC dynamic_taskq
TQ_LOCK_GENERAL any other
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
This reverts commit a430c11f0b. Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500
objsetid is not unique across pool, so using it solely as key would cause
panic when automounting two snapshot on different pools with the same
objsetid. We fix this by adding spa pointer as additional key.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3948
Issue #3786
Issue #3887
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame. This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.
This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#500
While stack size will vary by architecture it has historically defaulted to
8K on x86_64 systems. However, as of Linux 3.15 the default thread stack
size was increased to 16K. These kernels are now the default in most non-
enterprise distributions which means we no longer need to assume 8K stacks.
This patch takes advantage of that fact by appropriately reverting stack
conservation changes which were made to ensure stability. Changes which
may have had a negative impact on performance for certain workloads. This
also has the side effect of bringing the code slightly more in line with
upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4059
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5959https://github.com/illumos/illumos-gate/commit/ca0cc39
Porting notes:
illumos code doesn't check for feature_get_refcount() returning
ENOTSUP (which means feature is disabled) in zdb. zfsonlinux added
a check in https://github.com/zfsonlinux/zfs/commit/784652c
due to #3468. The check was reintroduced here.
Ported-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3965
Provide a generic interface to prefetch ZAP entries by name. This
functionality is being added for external consumers such as Lustre.
It is based of the existing zap_prefetch_uint64() version which is
used by the deduplication code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4061
If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.
We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#492
This was originally in fe0ed8f910, but somehow
was changed and not working anymore. And it will cause the following error:
modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4027
This was originally in e80cd06b8e, but somehow
was changed and not working anymore. And it will cause the following error:
modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#501
This is needed for architectures that do not have a builtin prefetchw()
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#502
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow,
causing it to be smaller than zfs_dirty_data_sync, and will cause txg being
delayed while no one write to disk. The end result is horrendous write speed.
On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it
can reach speed on par with 64-bit VM.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3973
On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL.
So we need to avoid reaping them.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3973
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.
Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.
The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4018
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned. This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter. This is what was causing the hard failures
in the automated testing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests. Large values can result in out of memory errors
for systems with less the 512M of memory. Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit. Otherwise, the arc_meta_limit
may be set the same as arc_max.
Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4001
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.
For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy
This will easily deadlock before this patch.
Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#496
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes#487
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().
Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.
We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.
We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3917
Issue #3880
Allocate a kmem cache magazine for every possible CPU which might
be added to the system. This ensures that when one of these CPUs
is enabled it can be safely used immediately.
For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint. In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#482
Strictly enforce keeping 'arc_c >= arc_c_min'. The ASSERTs are
left in place to catch this in a debug build but logic has been
added to gracefully handle in a production build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3904
We should never block when holding a spin lock, but zfs_inode_update can
block in the critical section of a spin lock in zfs_inode_update:
zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
All users of zv_lock were removed by 37f9dac, but we forgot to remove
it. Lets remove it as clean up.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
When doing uioskip to skip an iovec to the very end, the current loop
condition will falsely check pass the end of iovec. We fix this checking
uio_iovcnt first.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3806Closes#3850
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.
Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.
Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796
Userspace can trigger an assertion by passing a zero-length segment
when assertions are enabled:
[27961.614792] VERIFY3(skip < iov->iov_len) failed (0 < 0)
[27961.614795] PANIC at zfs_uio.c:187:uio_prefaultpages()
[27961.614805] Call Trace:
[27961.614811] dump_stack+0x45/0x57
[27961.614830] spl_dumpstack+0x44/0x50 [spl]
[27961.614834] spl_panic+0xbb/0x100 [spl]
[27961.614908] uio_prefaultpages+0x134/0x140 [zcommon]
[27961.614930] zfs_write+0x1fd/0xe80 [zfs]
[27961.615014] zpl_write_common_iovec+0x7f/0x110 [zfs]
[27961.615035] zpl_iter_write+0xa0/0xd0 [zfs]
[27961.615037] do_iter_readv_writev+0x59/0x80
[27961.615063] do_readv_writev+0x11b/0x260
[27961.615098] vfs_writev+0x39/0x50
[27961.615100] SyS_writev+0x4a/0xe0
[27961.615103] system_call_fastpath+0x16/0x6e
The solution is to delete the assertion. This could potentially
occur in uiomove as well, which contains analogous assertions
that appear similarly unnecessary, so we remove those as well.
Reported-by: Jonathan Vasquez <jvasquez1011@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3792
This reverts commit 5f8e1e8505. It
was determined that this patch introduced the quota regression
described in #3789.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
Issue #3789
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.
This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system. Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.
This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.
Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking. Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf(). This will deadlock, see
issue #3822 for full backtraces showing the problem.
To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function. This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer. Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
The zpl_nr_cached_objects() function has been disabled because in the
current code it doesn't provide any critical functionality and it may
result in a deadlock under certain circumstances. However, because
we expect to need these hooks in the future this code has not been
entirely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3719
Accessing a snapshot via NFS should cause an auto-unmount of that
snapshot to be deferred until such as time as the snapshot is idle.
This is analogous to the zpl_revalidate logic employed by locally
mounted snapshots.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3794
Commit torvalds/linux@4246a0b63b
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.
bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592 ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.
`zpios ... -S size | --blocksize size ...`
This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`
Example run below
(non realistic results from a VM and output abbreviated for space)
```
--regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
--threaddelay=0 --cleanup --human-readable --verbose --cleanup
--blocksize=128K,256K,512K,1M
th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw
---------------------------------------------------------------------
4 750 8m 1m 128k 5g 90.06m 5g 93.37m
4 750 8m 1m 256k 5g 79.71m 5g 99.81m
4 750 8m 1m 512k 5g 42.20m 5g 93.14m
4 750 8m 1m 1m 5g 35.51m 5g 89.36m
8 750 8m 1m 128k 5g 85.49m 5g 90.81m
8 750 8m 1m 256k 5g 61.42m 5g 99.24m
8 750 8m 1m 512k 5g 49.09m 5g 108.78m
16 750 8m 1m 128k 5g 86.28m 5g 88.73m
16 750 8m 1m 256k 5g 64.34m 5g 93.47m
16 750 8m 1m 512k 5g 68.84m 5g 124.47m
16 750 8m 1m 1m 5g 53.97m 5g 97.20m
---------------------------------------------------------------------
```
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3795Closes#2071
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:
* Libzfs has historically included "xattr" in its list of default mount
options. This overrides the dataset property, so the dataset is always
configured to use directory-based xattrs even when the xattr dataset
property is set to off or sa. Address this by removing "xattr" from
the set of default mount options in libzfs.
* There was no way to enable system attribute-based extended attributes
using temporary mount options. Add the mount options "saxattr" and
"dirxattr" which enable the xattr behavior their names suggest. This
approach has the advantages of mirroring the valid xattr dataset
property values and following existing conventions for mount option
names.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3787
Passing NULL for the mount data should not result in EINVAL. It
should be treated as if an empty string were passed and succeed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3771
37f9dac592 replaced the end-start
calculation with a cached value, but neglected to update it on discard
operations. This can cause us to discard data not requested, causing
data loss on zvols.
Reported-by: Richard Connon <richard.connon@zynstra.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3798
6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
References:
https://www.illumos.org/issues/6214http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/
Porting Notes:
Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t. This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held. See Illumos 6214
for a detailed analysis of the race.
HDR_GET_COMPRESS() macro was removed from arc_buf_info().
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3757
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume. Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes#474
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later. We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3741
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp message
1441141408 spa=tank async request task=1
1441141408 txg 70 open pool version 5000; software version 5000/5; ...
1441141409 spa=tank async request task=32
1441141409 txg 72 import pool version 5000; software version 5000/5; ...
1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3728
This patch is based on the previous work done by @andrey-ve and
@yshui. It triggers the automount by using kern_path() to traverse
to the known snapshout mount point. Once the snapshot is mounted
NFS can access the contents of the snapshot.
Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server. To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally. System administators desiring this
functionally must explicitly enable it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2797Closes#1655Closes#616
Prevents NFS client from detection of different fileids of snapshot root dentry
before & after snapshot mount.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Reclaim in the traverse prefetch thread, which is run on the system
taskq, can overrun the stack.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3733
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#985Closes#3351
Commit 49ddb31506 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table. The dbuf hash table's size should be determined
similarly.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3721
Allow for easy turning of a pools reserved free space. Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve. Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3724
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3712
Before read locking z_teardown_inactive_lock, we need to check if we have
already had write lock on it. Otherwise, we would deadlock on ourself when
doing rollback:
zfs_ioc_rollback
->zfs_suspend_fs (z_teardown_inactive_lock, RW_WRITER)
->zfs_resume_fs->zfs_rezget->zfs_iput_async->iput-> ...
->zfs_inactive (z_teardown_inactive_lock, RW_READER)
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2869
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#472
This reverts commit 076821e due to a locking issue uncovered in
subsequent testing. An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock. The patch will need to be reworked.
VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#472
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels. And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients. This patch makes
the following core improvements.
* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.
* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process. This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.
* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.
* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds. This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.
* Snapshots in active use will not be automatically unmounted. As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.
* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated. This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots. This patch modifies the
behavior such that only negative dentries are invalidated. This solves
the issue and may result in a performance improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3589Closes#3344Closes#3295Closes#3257Closes#3243Closes#3030Closes#2841
There's no metadata to write to disk for ctldir inodes. So we check if
a inode belongs to the ctldir in zpl_commit_metadata, and returns
immediately if it is.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2797
zvols should not be an entropy source for the kernel. Disable it to be
consistent with the upstream kernel.
torvalds/linux@b277da0a8a
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3713
Add a missing space to the zfs_vdev_sync_write_min_active module
parameter description.
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3714
As part of the stack reduction effort in
50b25b2187, a zio_t containing a taskq_ent
was added to struct vdev_queue which itself is part of struct vdev.
The taskq entry should be initialized as is currently done in zio_create()
for newly-created bare zio_t object. The rationale is the same as is
described in f467b05a26.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3709
If the ZAP object containing a snapshot map is corrupted due to an
unrecoverable checksum error or otherwise, dsl_dataset_name() will
normally panic the system due to its VERIFY.
This patch attempts to allow a recovery avenue from such situations by
manufacturing a descriptive snapshot name and then ignoring the error.
Scrubbing a pool with this type of corruption will then show the affected
object in the error list rather than panicking.
The recovery code is only enabled when the zfs_recover module parameter
is set.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3705
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create. Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.
In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3591
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.
This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs. The solution is to restore the maximum
size to ~10M. This patch specifically changes it to 16M which is
close enough.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3684
Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.
With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.
The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3511Closes#3640
The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place. This way there can never
exist a partial version of the file, it's all or nothing.
Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms. Even doing
basic I/O to a file system from within the kernel is strongly
discouraged. In order to support this at all the vn_rename()
implementation ends up being complex and fragile. So fragile
that recent Linux 4.2 changes have broken it.
While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely. Then all this complex, fragile code can
be removed. Achieving this is straight forward because
config_write() is the only consumer of vn_rename().
This patch reworks spa_config_write() to update the cache file
in place. The file will be truncated, written out, and then
synced to disk. If an error is encountered the file will be
unlinked leaving the system in a consistent state.
This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file. However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.
If this were to somehow happen it poses no risk to pool. Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.
Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3653
Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error. Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface. As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.
vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3653
Failing to lookup a name in the spa_ddt_stat_object should not result
in a panic in ddt_object_load(). The error can be safely returned to
the caller for handling resulting in a useful user error message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3370
Without the parenthesis, this particular ASSERT will evaluate to
"(RW_READER == (!zap->zap_ismicro && fatreader)) ? RW_READER : lti"
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3685
This brings the behavior of arc_memory_throttle() back in sync with
illumos. The updated memory throttling policy roughly goes like this:
* Never throttle if more than 10% of memory is free. This threshold
is configurable with the zfs_arc_lotsfree_percent module option.
* Minimize any throttling of kswapd even when free memory is below
the set threshold. Allow it to write out pages as quickly as
possible to help alleviate the memory pressure.
* Delay all other threads when free memory is below the set threshold
in order to avoid compounding the memory pressure. Buffers will be
evicted from the ARC to reduce the issue.
The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3637
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages. This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim. Conceptually
this brings arc_available_memory() back in sync with illumos.
It is also desirable that the target amount of free memory be tunable
on a system. While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.
zfs_arc_sys_free - The target number of bytes the ARC should leave
as free memory on the system. This value can
checked in /proc/spl/kstat/zfs/arcstats and
setting this module option will override the
default value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3637
The zvol_threads module option should be bounded to a reasonable
range. The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most. The default value of 32 is a reasonable
default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3614
Commit d958324 fixed the deadlock between page lock and range lock by
unlocking the page lock before acquiring the range lock. However,
this created a new issue #3075.
The problem is that if we can't set the write back bit before releasing
the page lock. Then other processes will be unaware that the page is
under active write back. They may therefore truncate the page,
invalidate the page, or not honor the sync semantics.
To workaround this problem we re-dirty the page before dropping the
page lock. While this doesn't prevent the page from being truncated
it does ensure it won't be invalidated. Then the range lock and the
page lock are reacquired in the correct deadlock-free order.
Once both locks are safely held the page state can be rechecked. If
all is well and the page is in the expect state the dirty bit can be
removed, the write back bit set, and the page removed from the skip
count. If not the page will be handled as appropriate.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3075
On Linux the meaning of a processes priority is inverted with respect
to illumos. High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
Note this change also reverts some of the priorities changes in prior
commit 62aa81a. The rational is as follows:
spl_kmem_cache - High priority may result in blocked memory allocs
spl_system_taskq - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority. Non-I/O filesystem
processes run with the default priority. ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active. The priorities have been updated to the following:
$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
- TS 10743 19 -20 [spl_kmem_cache]
- TS 10744 19 -20 [spl_system_task]
- TS 10745 19 -20 [spl_dynamic_tas]
- TS 10764 19 0 [dbu_evict]
- TS 10765 19 0 [arc_prune]
- TS 10766 19 0 [arc_reclaim]
- TS 10767 19 0 [arc_user_evicts]
- TS 10768 19 0 [l2arc_feed]
- TS 10769 39 0 [z_unmount]
- TS 10770 39 -20 [zvol]
- TS 11011 39 -20 [z_null_iss]
- TS 11012 39 -20 [z_null_int]
- TS 11013 39 -20 [z_rd_iss]
- TS 11014 39 -20 [z_rd_int_0]
- TS 11022 38 -19 [z_wr_iss]
- TS 11023 39 -20 [z_wr_iss_h]
- TS 11024 39 -20 [z_wr_int_0]
- TS 11032 39 -20 [z_wr_int_h]
- TS 11033 39 -20 [z_fr_iss_0]
- TS 11041 39 -20 [z_fr_int]
- TS 11042 39 -20 [z_cl_iss]
- TS 11043 39 -20 [z_cl_int]
- TS 11044 39 -20 [z_ioctl_iss]
- TS 11045 39 -20 [z_ioctl_int]
- TS 11046 39 -20 [metaslab_group_]
- TS 11050 19 0 [z_iput]
- TS 11121 38 -19 [z_wr_iss]
Note that under Linux the meaning of a processes priority is inverted
with respect to illumos. High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
This patch depends on https://github.com/zfsonlinux/spl/pull/466.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3607
A NULL should never be passed as the dnode_t pointer to the function
dmu_free_long_range_impl(). Regardless, because we have a reported
occurrence of this let's add some error handling to catch this.
Better to report a reasonable error to caller than panic the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3445
As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy. They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.
Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #463
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority. Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority. This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.
All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri. The vast majority of callers were
part of the test suite which won't have an external impact. The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri. This makes it more likely the process will be scheduled
which may help performance.
To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added. When disabled (0) all newly created kernel
threads will use the default kernel thread priority. When enabled (1)
the specified taskq priority will be used. By default this value is
enabled (1).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Address minor differences in style between upstream and ZoL. This
patch contains no functional differences and is solely designed to
minimize the delta from upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
Commit d962d5d didn't quite properly resolve the HDR_L2ONLY_SIZE
accounting. Accounting is now performed only in the constructor
and destructor which is a nice simplification. It should have
been removed the from create and destroy functions. This brings
up back in sync with upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
Originally removed because it wasn't required under Linux. However,
there may still be some utility in signaling the arc reclaim thread
under Linux via reclaim. This should already have happened by other
means but it's not harmless and reduces another point of divergence
with upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c". This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).
This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream. Therefore
this code is being restored to facilitate benchmarking. By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5817https://github.com/illumos/illumos-gate/commit/2fd872a
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5445 Add more visibility via arcstats; specifically arc_state_t
stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5445https://github.com/illumos/illumos-gate/commit/4076b1b
Porting Notes:
This patch is an improved version of cc7f677 which was previously
merged in ZoL. This patch incorporates the additional improvements
which were made upstream.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5376https://github.com/illumos/illumos-gate/commit/2ec99e3
Porting Notes:
The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.
* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back. This function
has been customized for Linux so that it can be used to determine a low
memory. This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base. The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.
* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167. Since there were related changes to this
function it made sense to include that work.
* The arc_init() function has been re-factored. As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options. The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3616Closes#2167
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:
Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,
Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issues #463
When an inode is detected with invalid mode bits the safe thing to
do is panic the system. This indicates a problem with the contents
of a dnode and it should never be possible. This is the default
behavior.
Unfortunately, due to flaws in the system attribute (SA) implementation
(on all platforms) it was possible that ZFS could create a damaged dnode.
This was a rare issue which only impacted dnodes which used a spill
block. Normally only symlinks and files with ACLs would require a
spill block. However, if the dataset had the xattr=sa property set
and extended attributes were used this problem could occur.
As of the 0.6.4 tag the root cause of this issue has been fixed. For
pools which are exhibiting this damage the 'zfs_recover=1' module option
may be set. This will cause ZFS to interpret the dnode with invalid
mode bits as a normal file. This may allow the files to be accessed
for recovery purposes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3548
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure \
--with-spl=$HOME/src/git/spl/ \
--with-spl-obj=$HOME/src/git/spl/build
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1082
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082
After a successful write the inode must be updated under the range
lock. If it is updated after dropping the lock there exists a race
where the znode and inode wile disagree about the file size. This
could result in narrow window of time where read(2) is able to access
data beyond what fstat(2) reports as the file size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3601
As of Linux 4.2 the kernel has completely retired the nameidata
structure. One of the few remaining consumers of this interface
were the follow_link() and put_link() callbacks.
This patch adds the required checks to configure to detect the
interface change and updates the functions accordingly. Migrating
to the simple_follow_link() interface was considered but was decided
against ironically due to the increased complexity.
It also should be noted that the kernel follow_link() and put_link()
interfaces changes several times after 4.1 and but before 4.2. This
means there is a narrow range of kernel commits which never appear
in an official tag of the Linux kernel which ZoL will not build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3596
Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to
bio->__bi_cnt. Because this value is only used once in a block of
debug code it simplest just to remove the PANIC. To my knowledge
this debugging has never been hit or proved useful so this is no
great loss.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#3596
Many key internal functions pass system return codes that are safe to
return to userland. In the case of ddi_copyin(9F), an error passes -1
and the documentation states very clearly that drivers should pass
EFAULT to userland when this happens.
http://illumos.org/man/9F/ddi_copyin
This does not happen in the ZFS source code. I believe it should be
changed to pass EFAULT. I caught this when writing man pages for the
libzfs_core API.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3575
Translate zio requests with ZIO_PRIORITY_SYNC_READ and
ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting
READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns
out to have a pronounced effect when writing to an SSD-based SLOG.
When WRITE_SYNC is not set (WRITE is set instead), the block trace
for a SLOG device looks as follows:
...
130,96 0 3 0.008968390 0 C W 830464 + 136 [0]
130,96 0 4 0.011999161 0 C W 830720 + 136 [0]
130,96 0 5 0.023955549 0 C W 831744 + 136 [0]
130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952
130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6]
130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6]
130,96 0 9 0.024343187 19775 P N [z_wr_iss/6]
130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6]
130,96 0 11 0.026784405 0 UT N [swapper] 1
130,96 0 12 0.026805339 202 U N [kblockd/0] 1
130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0]
130,96 0 14 0.026966948 0 C W 832000 + 136 [0]
130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904
130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19]
130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19]
130,96 3 4 0.000455956 19788 P N [z_wr_iss/19]
130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19]
130,96 3 6 0.002786349 0 UT N [swapper] 1
...
Here the 130,197 is the partition created on the log device when adding it
to the pool, whereas the base device is 130,96. As one can see, the writes
to the SLOG are not marked synchronous (the S is missing next to W), and
the queue unplugs occur based on the timer (UT event) resulting in slightly
over 2 msec latency of writes. This results in a sub-par performance of
single stream synchronous writes (limited by latency of the SLOG).
When the WRITE_SYNC is set, a similar trace looks as follows:
...
130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528
130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)]
130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)]
130,96 4 4 0.000003394 70714 P N [(null)]
130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)]
130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)]
130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784
130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)]
130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)]
130,96 5 4 0.000357635 70713 P N [(null)]
130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)]
130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)]
130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0]
130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040
130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15]
130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15]
130,96 16 88 0.000745817 70718 P N [z_wr_iss/15]
130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15]
130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15]
130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0]
130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0]
As one can see, all the writes are synchronous (WS), and I/O completions
(e.g. from issue I to completion C) take 160-250 usec, or about 10x faster.
Since WRITE_SYNC or READ_SYNC flags are among several factors that are
considered when processing bio requests, it seems prudent to mark all the
zio requests of synchronous priority with the READ/WRITE_SYNC flags to make
them eligible for consideration as such by the Linux block I/O layer.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3529
As of gcc version 5.1.1 a new warning has been added to detect the
use of a boolean in a switch statement (-Wswitch-bool). Resolve the
warning by explicitly casting the value to an integer type.
zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request':
error: switch condition has boolean value [-Werror=switch-bool]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://github.com/illumos/illumos-gate/commit/db1741fhttps://www.illumos.org/issues/5661
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3571
Reclaim during metaslab preloading can cause deadlocks involving znode
z_lock and ARC buffer header ht_lock.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3532.
5008 lock contention (rrw_exit) while running a read only load
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Porting notes:
This patch ported perfectly cleanly to ZoL. During testing 100% cached
small-block reads, extreme contention was noticed on rrl->rr_lock from
rrw_exit() due to the frequent entering and leaving ZPL. Illumos picked
up this patch from FreeBSD and it also helps under Linux.
On a 1-minute 4K cached read test with 10 fio processes pinned to a single
socket on a 4-socket (10 thread per socket) NUMA system, contentions on
rrl->rr_lock were reduced from 508799 to 43085.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3555
5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots
5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
References:
https://www.illumos.org/issues/5946https://www.illumos.org/issues/5945https://github.com/illumos/illumos-gate/commit/24218be
Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3552
5909 ensure that shared snap names don't become too long after promotion
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5909https://github.com/illumos/illumos-gate/commit/cb5842f
Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3550
5912 full stream can not be force-received into a dataset if it has a snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5912https://github.com/illumos/illumos-gate/commit/5bae108
Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3549
5175 implement dmu_read_uio_dbuf() to improve cached read performance
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5175https://github.com/illumos/illumos-gate/commit/f8554bb
Porting notes:
This patch doesn't include the changes for the COMSTAR (Common
Multiprotocol SCSI Target) - since it's not available for ZoL.
http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html
Ported by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3392
5368 ARC should cache more metadata
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5368https://github.com/illumos/illumos-gate/commit/3a5286a
Porting Notes:
The vast majority of this patch was already merged in the context
of the 06358ea changes. This is just a small hunk which was missed.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
5163 arc should reap range_seg_cache
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5163https://github.com/illumos/illumos-gate/commit/83803b5
Porting Notes:
Added umem_cache_reap_now() wrapped to suppress unused variable
warning for user space build in arc_kmem_reap_now().
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Over the years the default values for the taskqs used on Linux have
differed slightly from illumos. In the vast majority of cases this
was done to avoid creating an obnoxious number of idle threads which
would pollute the process listing.
With the addition of support for dynamic taskqs all multi-threaded
queues should be created as dynamic taskqs. This allows us to get
the best of both worlds.
* The illumos default values for the I/O pipeline can be restored.
These values are known to work well for most workloads. The only
exception is the zio write interrupt taskq which is changed to
ZTI_P(12, 8). At least under Linux more threads has been shown
to improve performance, see commit 7e55f4e.
* Reduces the number of idle threads on the system when it's not
under heavy load. The maximum number of threads will only be
created when they are required.
* Remove the vdev_file_taskq and rely on the system_taskq instead
which is now dynamic and may have up to 64-threads. Again this
brings us back inline with upstream.
* Tasks dispatched with taskq_dispatch_ent() are allowed to use
dynamic taskqs. The Linux taskq implementation supports this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3507
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.
The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a9.
The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.
Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size. The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.
The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size. This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.
Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size. The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a9
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).
Porting Notes:
Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.
References:
https://reviews.csiden.org/r/229/https://reviews.freebsd.org/D2764
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3400Closes#3433Closes#3451
Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs
to reduce the number of idle threads on the system. Additional
threads will be created on demand up to the previous maximum
thread counts. This should have minimal, if any, impact on
performance.
This makes the system taskq consistent with illumos which is
always created as a dynamic taskq with up to 64 threads.
The task limits for the kmem_cache have been increased to avoid
any unnessisary throttling and to keep a larger reserve of
task_t structures on the free list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#458
Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics. Initially only a single worker thread will be created
to service tasks dispatched to the queue. As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'. When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.
Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.
* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned. Default 4.
* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system. This is provided
for the purposes of debugging and troubleshooting. Default 1
(enabled).
This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#458
Unit testing at ClusterHQ found that passing an invalid file handle to
zfs_ioc_hold results in a NULL pointer dereference on a system without
assertions:
IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs]
Call Trace:
[<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs]
[<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs]
[<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs]
An assertion would have caught this had they been enabled, but this is
something that the kernel module should handle without failing. We
resolve this by searching the linked list to ensure that the file
handle's private_data points to a valid zfsdev_state_t.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3506
This seems generally useful. metaslab_aliquot is the ZFS allocation
granularity, which is roughly equivalent to what is called the stripe
size in traditional RAID arrays. It seems relevant to performance
tuning.
Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The metaslab allocator device selection algorithm contains a bias
mechanism whose goal is to achieve roughly equal disk space usage across
all top-level vdevs.
It seems that the initial rationale for this code was to allow newly
added (empty) vdevs to "come up to speed" faster in an attempt to make
the pool quickly converge to a steady state where all vdevs are equally
utilized.
While the code seems to work reasonably well for this use case, there
is another scenario in which this algorithm fails miserably: the case
where top-level vdevs don't have the same sizes (capacities). ZFS
allows this, and it is a good feature to have, so that users who simply
want to build a pool with the disks they happen to have lying around can
do so even if the disks have heteregenous sizes.
Here's a script that simulates a pool with two vdevs, with one 4X larger
than the other:
dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728
dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912
zpool create testspace /tmp/d1 /tmp/d2
dd if=/dev/zero of=/testspace/foobar bs=1M count=256
zpool iostat -v testspace
Before this commit, the script would output the following:
capacity
pool alloc free
---------- ----- -----
testspace 252M 375M
/tmp/d1 104M 18.5M
/tmp/d2 148M 356M
---------- ----- -----
This demonstrates that the current code handles this situation very
poorly: d1 shows 85% usage despite the pool itself being only 40% full.
d1 is quite saturated at this point, and is slowing down the entire pool
due to saturation, fragmentation and the like.
In contrast, here's the result with the code in this commit:
capacity
pool alloc free
---------- ----- -----
testspace 252M 375M
/tmp/d1 56.7M 66.3M
/tmp/d2 195M 309M
---------- ----- ------
This looks much better. d1 is 46% used, which is close to the overall
pool utilization (40%). The code still doesn't result in perfectly
balanced allocation, probably because of the way mg_bias is applied
which does not guarantee perfect accuracy, but this is still much better
than before.
Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3389
For kernels which do not implement a per-suberblock shrinker,
those older than Linux 3.1, the shrink_dcache_parent() function
was used to attempt to reclaim dentries. This was found not be
entirely reliable and could lead to performance issues on older
kernels running meta-data heavy workloads.
To address this issue a zfs_sb_prune_aliases() function has been
added to implement this functionality. It relies on traversing
the list of znodes for a filesystem and adding them to a private
list with a reference held. The private list can then be safely
walked outside the z_znodes_lock to prune dentires and drop the
last reference so the inode can be freed.
This provides the same synchronous behavior as the per-filesystem
shrinker and has the advantage of depending on only long standing
interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3501
The number of threads in the iput taskq has been increased to speed
up the number of iputs which can be handled. This has been observed
to improve the meta data reclaim regardless of zfs_sb_prune()
implementation in use.
The taskq has also been renamed z_iput to for consistency with the
rest of the I/O pipeline taskqs which are all named z_*.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods.
The ->read_iter() and ->write_iter() methods were designed to replace
the ->aio_read() and ->aio_write() interfaces. Both interfaces were
preserved for several kernel releases in order to migrate all existing
consumers to the new interfaces. But as of Linux 4.1 the legacy
interface has been retired and the ZFS code must be updated to use
the new interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3352
Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in
ZoL by zfs_sb_prune(). This patch calls the shrinker for each on-line
NUMA node in order that memory be freed for each one.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3495
The Linux kernel watchdog will automatically dump a backtrace for
any process while sleeps for over 120s in an uninterruptible state.
The solution is for the prefetch thread to sleep in an interruptible
state. The way the existing code was written this is safe because
when woken it will always reevaluate its conditional. As a general
rule it is preferable to sleep in an interruptible when possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3450Closes#3402
This is the counterpart to zfsonlinux/spl@2345368 which replaces the
cv_wait_interruptible() function with cv_wait_sig(). There is no
functional change to patch merely brings the function names in to
sync to maximize portability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3450
Issue #3402
ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny
systems such as the raspberry pi, however, as of addition of large block
support, the arc_adapt() function depends on arc_c being >= 32MiB (2 *
SPA_MAXBLOCKSIZE).
This patch raises the minimum ARC size to 32MiB and adds a VERIFY test
to arc_adapt() for future-proofing.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock. This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous. Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.
To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:
* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq. This makes it
suitable to use in the context of the arc_adapt_thread().
* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete. This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.
This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured. It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Replace taskq_wait() with taskq_wait_oustanding(). This way callers
will only block until previously submitted tasks have been completed.
This was the previous behavior of task_wait() prior to the introduction
of taskq_wait_outstanding() so this isn't really a functionalty change
for these callers.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Porting notes and other significant code changes:
The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.
Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.
The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.
Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:
302f753 - Integrate ARC more tightly with Linux
39e055c - Adjust arc_p based on "bytes" in arc_shrink
f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
77765b5 - Remove "arc_meta_used" from arc_adjust calculation
94520ca - Prune metadata from ghost lists in arc_adjust_meta
Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:
fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }
The following arcstats have been removed:
recycle_miss - Used by arcstat.py and arc_summary.py, both of which
have been updated appropriately.
l2_writes_hdr_miss
The following arcstats have been added:
evict_not_enough - Number of times arc_evict_state() was unable to
evict enough buffers to reach its target amount.
evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
because it was being written to the l2arc.
l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times
l2arc_write_done() failed to acquire hash_lock (and re-tries).
arc_meta_min - Shows the value of the zfs_arc_meta_min module
parameter (see below).
The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme. Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.
The following module parameters have been added:
zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
before moving on to the next sub-list.
zfs_arc_meta_min - Enforce a floor on the amount of metadata in
the ARC.
zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
ARC state.
zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
the target size to be considered "overflowing".
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Porting notes:
Due to the restructuring of the ARC-related structures, this
patch conflicts with at least the following existing ZoL commits:
6e1d7276c9
Fix inaccurate arcstat_l2_hdr_size calculations
The ARC_SPACE_HDRS constant no longer exists and has been
somewhat equivalently replaced by HDR_L2ONLY_SIZE.
e0b0ca983d
Add visibility in to cached dbufs
The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr
struct requires additional structure member names to be used
when referencing the inner items. Also, the presence of L1 or L2
inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR
macros.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Porting notes:
ZoL has moved some ARC definitions into arc_impl.h.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Tim Chase <tim@chase2k.com>
This reverts only the l2arc_hdr part of commit
ecf3d9b8e6 in preparation for the illumos
5497 "lock contention on arcs_mtx" patch which does the same thing
but uses the newer two-level ARC structure following the Illumos 5408
"managing ZFS cache devices requires lots of RAM" patch.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates
the need for this.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 037763e44e in
preparation for the illumos 5497 "lock contention on arcs_mtx" patch
which includes a fix for this very problem.
ZoL had picked up a subset of the illumos 5497 patch to deal with the
l2arc compression buffer leak.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 16fcdea363 in preparation
for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates
"marker" within the ARC code.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit c3520e7 restructured vdev_add_child() in such a way that
the spa variable was unused during non-debug builds. This is
consistent with the upstream illumos code but because ZoL, unlike
illumos, is built with all compiler warnings enabled this causes
a legitimate warning. Revert this hunk of the patch to keep the
build clean.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3432
Commit f752b46e added the cv_wait_interruptible() function to allow
condition variables to be woken by signals. This function and its
timed wait counterpart should have been named cv_wait_sig() to match
the illumos interface which provides the same functionality.
This patch renames the symbol but leaves a #define compatibility
wrapper in place until the ZFS code can be moved to the correct
name.
This patch also makes a small number of cosmetic changes to make
the condvar source and header cstyle clean.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#456
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Albert Lee <trisk@omniti.com>
References:
https://www.illumos.org/issues/5818https://github.com/illumos/illumos-gate/commit/81cd5c5
Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3432
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5269https://github.com/illumos/illumos-gate/commit/12380e1e
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3396
Under Illumos taskq_wait() returns when there are no more tasks
in the queue. This behavior differs from ZoL and FreeBSD where
taskq_wait() returns when all the tasks in the queue at the
beginning of the taskq_wait() call are complete. New tasks
added whilst taskq_wait() is running will be ignored.
This difference in semantics makes it possible that new subtle
issues could be introduced when porting changes from Illumos.
To avoid that possibility the taskq_wait() function is being
updated such that it blocks until the queue in empty.
The previous behavior remains available through the
taskq_wait_outstanding() interface. Note that this function
was previously called taskq_wait_all() but has been renamed
to avoid confusion.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#455
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is being freed then the bonus
dbuf may be in the process of getting evicted. In this case there is a
race that may cause dmu_objset_userquota_get_ids() to access the dbuf
after it has been destroyed. To prevent this, ensure that when we are
using the bonus dbuf we are either holding a reference on it or have
taken dn_struct_rwlock.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3443
- Don't check db->bb_blkid, but use the blkid argument instead.
Checking db->db_blkid may be unsafe since we doesn't yet have a
hold on the dbuf so its validity is unknown.
- Call mutex_exit() on found_db, not db, since it's not certain that
they point to the same dbuf, and the mutex was taken on found_db.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
5243 zdb -b could be much faster
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5243https://github.com/illumos/illumos-gate/commit/f7950bf
Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3414
There seems to be a annoying problem using NFSv4 to access ZFS
file systems under certain circumstances. It's easily reproduced:
nfs_client1: mount server:/export /mnt
nfs_client1: cd /mnt
nfs_client1: echo foo >junk
nfs_client1: cat junk
foo
Now on a different NFSv4 client:
nfs_client2: mount server:/export /mnt
nfs_client2: cd /mnt
nfs_client2: vi junk
# Make some changes to /mnt/junk and save
# This change the inode associated with /mnt/junk
Now back to the original client:
nfs_client1: cat junk
cat: junk: No such file or directory
Admittedly NFSv4 is not advertised as a cluster file system that
maintains a completely coherent view of data across multiple nodes.
But it does have some mechanisms built in that try to deal with
situations like the above. Namely, it employs specialized file
handle lookup routines that return ESTALE when a file handle contains
a non-existant inode value. The ESTALE return triggers a return
full file path lookup from the client to determine if the file has
actually gone away or if the cached file handle is no longer valid.
ZFS behavior can be brought into line with other file systems
(e.g., ext4) by applying the following patch:
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3224
Per the documentation for dnode_next_offset in dnode.c, the "txg"
parameter specifies a lower bound on which transaction the dnode can
be found in. We are interested in all dnodes that are removed between
the first and last transaction in the snapshot. It doesn't need to be
created in that snapshot to correspond to a removed file.
In fact, the behavior of zfs diff in the test case exactly matches
this: the transaction that created the data that was deleted in snapshot
"2" was produced before, in snapshot "1", definitely predating the first
transaction in snapshot "2".
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com>
Closes#2081
When ASSERTs are compiled out by using the --disable-debug configure
option. Then the local variable 'dsl_pool_t *dp' will be unused and
generate a compiler warning. Since this variable is only used once
in the ASSERT replace it with 'ds->ds_dir->dd_pool'.
This has the additional advantage of potentially saving a few bytes
on the stack depending on how gcc decides to compile the function.
This issue was not noticed immediately because the automated builders
use --enable-debug to make the testing as rigorous as possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#3410
5765 add support for estimating send stream size with lzc_send_space when source is a bookmark
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
https://www.illumos.org/issues/5765https://github.com/illumos/illumos-gate/commit/643da460
Porting notes:
* Unused variable 'recordsize' in dmu_send_estimate() dropped
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3397
5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5810https://github.com/illumos/illumos-gate/commit/732885fc
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3387
5351 scrub goes for an extra second each txg
5352 scrub should pause when there is some dirty data
Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5351https://github.com/illumos/illumos-gate/commit/6f6a76a
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3383
Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5350https://github.com/illumos/illumos-gate/commit/e651831
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3382
Author: Alex Reece <alex@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5422https://github.com/illumos/illumos-gate/commit/a846f19
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3381
The security and ACL operations should all be performed atomically.
To accomplish this there would need to significant invasive changes
made to the common code base. For the moment it's desirable for
compatibility reasons to avoid this. Therefore the code has been
updated to attempt to unwind the operation in case of failure
rather than panic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2445
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5027https://github.com/illumos/illumos-gate/commit/b515258
Porting Notes:
* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.
* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes. Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.
* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize. This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.
* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX. This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M.
* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space. We should be able to relax
this one the ABD patches are merged.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#354
The metaslab_min_alloc_size option is no longer used in the code.
This functionality was removed by commit f3a7f66 and the module
options should have been dropped at that time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
With debugging enabled and depending on your kernel config, the size of
arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost()
to greater than 1024 bytes. Let's avoid this.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3377
5349 verify that block pointer is plausible before reading
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Xin Li <delphij@FreeBSD.org>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5349https://github.com/illumos/illumos-gate/commit/f63ab3d5
Porting notes:
* Several variable declarations were moved due to C style needs
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3373
By the time we're tearing down our superblock the VFS has started releasing
all our inodes/znodes. Some of this work may have been handed off to our
iput taskq so we need to wait for that work to complete. However the iput
from the taskq can itself result in additional work being added to the
taskq:
dsl_pool_iput_taskq
iput
iput_final
evict
destroy_inode
zpl_inode_destroy
zfs_inode_destroy
zfs_iput_async(ZTOI(zp->z_xattr_parent))
taskq_dispatch(dsl_pool_iput_taskq..., iput, ...)
Let's wait until all our znodes have been released.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3281
5348 zio_checksum_error() only fills in info if ECKSUM
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5348https://github.com/illumos/illumos-gate/commit/373dc1cf
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3372
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5808https://github.com/illumos/illumos-gate/commit/23367a2f
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3369
5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5814https://github.com/illumos/illumos-gate/commit/b67dde11
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3368
4951 ZFS administrative commands should use reserved space, not with ENOSPC
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4373https://github.com/illumos/illumos-gate/commit/7d46dc6
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
3654 zdb should print number of ganged blocks
3656 remove unused function zap_cursor_move_to_key()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3654https://www.illumos.org/issues/3656https://github.com/illumos/illumos-gate/commit/d5ee8a1
Porting Notes:
3655 and 3657 were part of this commit but those hunks were dropped
since they apply to mdb.
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
It's been reported that threads would loop infinitely inside zfs_zget. The
speculated cause for this is that if an inode is marked for evict, zfs_zget
would see that and loop. However, if the looping thread doesn't yield, the
inode may not have a chance to finish evict, thus causing a infinite loop.
This patch solve this issue by add cond_resched to zfs_zget, making the
looping thread to yield when needed.
Tested-by: jlavoy <jalavoy@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3349
5244 zio pipeline callers should explicitly invoke next stage
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5244https://github.com/illumos/illumos-gate/commit/738f37b
Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools"
caused a merge conflict due to a copyright difference in
module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled"
and additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2828
5812 assertion failed in zrl_tryenter(): zr_owner==NULL
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5812https://github.com/illumos/illumos-gate/commit/8df1730
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3357
5592 NULL pointer dereference in dsl_prop_notify_all_cb()
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5592https://github.com/illumos/illumos-gate/commit/9d47dec
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5531 NULL pointer dereference in dsl_prop_get_ds()
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5531https://github.com/illumos/illumos-gate/commit/e57a022
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5056 ZFS deadlock on db_mtx and dn_holds
Author: Justin Gibbs <justing@spectralogic.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5056https://github.com/illumos/illumos-gate/commit/bc9014e
Porting Notes:
sa_handle_get_from_db():
- the original patch includes an otherwise unmentioned fix for a
possible usage of an uninitialised variable
dmu_objset_open_impl():
- Under Illumos list_link_init() is the same as filling a list_node_t
with NULLs, so they don't notice if they miss doing list_link_init()
on a zero'd containing structure (e.g. allocated with kmem_zalloc as
here). Under Linux, not so much: an uninitialised list_node_t goes
"Boom!" some time later when it's used or destroyed.
dmu_objset_evict_dbufs():
- reduce stack usage using kmem_alloc()
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5095 panic when adding a duplicate dbuf to dn_dbufs
Author: Alex Reece <alex@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5095https://github.com/illumos/illumos-gate/commit/86bb58a
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
4873 zvol unmap calls can take a very long time for larger datasets
Author: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4873https://github.com/illumos/illumos-gate/commit/0f6d88a
Porting Notes:
dbuf_free_range():
- reduce stack usage using kmem_alloc()
- the sorted AVL tree will handle the spill block case correctly
without all the special handling in the for() loop
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3897 zfs filesystem and snapshot limits (fix leak)
Author: Alex Reece <alex.reece@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3897https://github.com/illumos/illumos-gate/commit/fb7001f
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3897https://github.com/illumos/illumos-gate/commit/a2afb61
Porting Notes:
dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In traverse_visitbp(), the input argument dnp is modified in the middle to
point to a temporary buffer. Originally this doesn't matter, because no user
of TRAVERSE_POST dereferences it. However, in fbeddd6 a piece of code is added
dereferencing dnp after the modification, creating a possible bug.
We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case,
so we don't modify the input argument. Also we introduce different local
variables in the DMU_OT_OBJSET case to prevent confusion between the input
argument.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2060
This isn't required for the Linux port because the kernel tracks
if a module is busy. The prototype for spa_busy() is also removed
since its definition was already removed.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3262
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5313https://github.com/illumos/illumos-gate/commit/fe319232
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3280
The function spa_add_feature_stats() manipulates the shared nvlist
spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to
protect the list.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3335
The following panic would occur under certain heavy load:
[ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held
[ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P OE 3.18.10 #7
The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which
would purge all tsd data for the thread. However, ZFS_ENTER is designed to be
reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data.
Instead, we rely on the new behavior of tsd_set. When NULL is passed as the
new value to tsd_set, it will automatically remove the tsd entry specified the
the key for the current thread.
rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when
they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to
do clean up. Now we explicitly call tsd_set(key, NULL) on them.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3247
This patch only addresses the issues identified by the style checker
in spl-tsd.c. It contains no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd
entry for the current thread. This is alright since tsd_get() returns NULL
when the entry doesn't exist.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#443
C type coercion rules require that negative numbers be converted into
positive numbers via wraparound such that a negative -1 becomes a
positive 1. This causes vn_getf to return a file handle when it should
return NULL whenever a positive file descriptor existed with the same
value. We should check for a negative file descriptor and return NULL
instead.
This was caught by ClusterHQ's unit testing.
Reference:
http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#450
Additional testing has shown that the region covered by PF_FSTRANS
needs to be extended to cover the zpl_xattr_security_init() and
init_acl() functions. The zpl_mark_dirty() function can also recurse
and therefore must always be protected.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#3331
Prevent deadlocks by disabling direct reclaim during all NFS, xattr,
ctldir, and super function calls. This is related to 40d06e3.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3225
The Linux slab, in general, performs better than the SPl slab in cases
where a lot of objects are allocated and fragmentation is likely present.
This patch fixes pathologically bad behavior in cases where the ARC is
filled with mostly metadata and a user program needs to allocate and
dirty enough memory which would require an insignificant amount of the
ARC to be reclaimed.
If zfs_znode_cache is on the SPL slab, the system may spin for a very
long time trying to reclaim sufficient memory. If it is on the Linux
slab, the behavior has been observed to be much more predictible; the
memory is reclaimed more efficiently.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3283
The packed nvlist allocated in spa_config_write() may exceed the
warning threshold for large configurations. Use the vmem interfaces
for this short lived allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3251
When layered on XFS the following warning will be emitted under CentOS7
when entering vfs_fsync() with PF_FSTRANS already set. This is not an
issue for other stock Linux file systems and the warning was removed
for newer kernels. However, to avoid triggering this error PF_FSTRANS
is cleared and then reset in vn_fsync().
WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0
Call Trace:
[<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs]
[<ffffffff8114b833>] __writepage+0x13/0x50
[<ffffffff8114c341>] write_cache_pages+0x251/0x4d0
[<ffffffff8114c60d>] generic_writepages+0x4d/0x80
[<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs]
[<ffffffff8114d68e>] do_writepages+0x1e/0x40
[<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80
[<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70
[<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs]
[<ffffffff811df54b>] vfs_fsync+0x2b/0x40
[<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl]
[<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs]
[<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs]
[<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs]
[<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs]
[<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs]
[<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs]
[<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs]
[<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0
[<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0
[<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl
calls as well as the l2arc and adapt ARC threads.
This obviates the need for MUTEX_FSTRANS so its previous uses and
definition have been eliminated.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3225
Avoid deadlocks when entering the shrinker from a PF_FSTRANS context.
This patch also reverts commit d0d5dd7 which added MUTEX_FSTRANS. Its
use has been deprecated within ZFS as it was an ineffective mechanism
to eliminate deadlocks. Among other things, it introduced the need for
strict ordering of mutex locking and unlocking in order that the
PF_FSTRANS flag wouldn't set incorrectly.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#446
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5693https://github.com/illumos/illumos-gate/commit/7f7ace3
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3231
5694 traverse_prefetcher does not prefetch enough
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5694https://github.com/illumos/illumos-gate/commit/34d7ce05
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3230
Align code in traverse_visitbp() with that in Illumos in preparation for
applying Illumos-5694.
No functional change: use a temporary variable pd to replace multiple
occurrences of td->td_pfd. This increases our stack use slightly more
then normal because the function is called recursively.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3230
5695 dmu_sync'ed holes do not retain birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5695https://github.com/illumos/illumos-gate/commit/70163ac
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3229
When called to free a spill block from a dnode, dbuf_free_range() has a
bug that results in all dbufs for the dnode getting freed. A variety of
problems may result from this bug, but a common one was a zap lookup
tripping an ASSERT because the zap buffers had been zeroed out. This
could happen on a dataset with xattr=sa set when extended attributes are
written and removed on a directory concurrently with I/O to files in
that directory.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#3195Fixes#3204Fixes#3222
ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it
the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS).
This cap was removed in torvalds/linux@34b48db. This patch changes
it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in
dmu_tx_hold_write() to allow the maximum transfer size.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3212
The zio_inject.c keeps zio_injection_enabled as a counter of
fault handlers, so it should not be exported to user space as
a module option.
Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c,
where the symbols are defined.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3199
zfs_sb_t has grown to the point where using kmem_zalloc() for allocations
is triggering the 32k warning threshold.
We can't safely convert this entire allocation to use vmem_alloc() instead
of kmem_alloc() because the backing_dev_info structure is embedded here.
It depends on the bit_waitqueue() function which won't behave properly
when given a virtual address.
Instead, use vmem_alloc() to allocate the z_hold_mtx array separately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#3178
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit. Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers. In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.
Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers. This ensures forward progress is maintained and arc_meta_used
will decrease. Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache. This will make dnode meta data buffers
available for reclaim. The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL. The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC. The ZPL
would iterate over all ZFS super blocks and perform the reclaim.
For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems. This patch
is designed to address those issues.
1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback. The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.
2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality. This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed. This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes. This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.
3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback. This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache. If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
The arc_meta_max value should be increased when space it consumed not when
it is returned. This ensure's that arc_meta_max is always up to date.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
5630 stale bonus buffer in recycled dnode_t leads to data corruption
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5630https://github.com/illumos/illumos-gate/commit/cd485b4
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
5047 don't use atomic_*_nv if you discard the return value
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5047https://github.com/illumos/illumos-gate/commit/640c167
Porting Notes:
Several hunks from the original patch where not specific to ZFS
and thus were dropped.
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
Allowing direct reclaim to re-enter the VFS in the zfs_inactive()
call path has historically been problematic for ZoL. Therefore,
in order to avoid an entire class of current and future issues
caused by this PF_FSTRANS is set for all zfs_inactive() callers.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3163
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3082
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible. This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1445
The 'capabilities' argument which was passed to bdi_setup_and_register()
has been removed. File systems should no longer pass BDI_CAP_MAP_COPY.
For our purposes this means there are now three different interfaces
which must be handled. A zpl_bdi_setup_and_register() wrapper function
has been introduced to provide a single interface to the ZPL code.
* 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported.
* 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments.
* 4.0 - x.y, bdi_setup_and_register() takes 2 arguments.
I've also taken this opportunity to remove HAVE_BDI because kernels
older then 2.6.32 are no longer supported. All kernels newer than
this will have one of the above interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#3128
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held. The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.
1) It would require changes to a significant amount of the ZFS
code. This would complicate applying patches from upstream.
2) It would be easy to accidentally miss a critical region in
the initial patch or to have an future change introduce a
new one.
Both of these concerns can be addressed by using a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch
'kmem-rework'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3050Closes#3055Closes#3062Closes#3132Closes#3142Closes#2983
Slightly increasing the size of a kmutex_t has caused us to exceed
the stack frame warning size in splat_taskq_test2_impl(). To address
this the tq_args have been moved to the heap.
cc1: warnings being treated as errors
spl-0.6.3/module/splat/splat-taskq.c:358:
error: the frame size of 1040 bytes is larger than 1024 bytes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held. The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.
1) It would require changes to a significant amount of the ZFS
code. This would complicate applying patches from upstream.
2) It would be easy to accidentally miss a critical region in
the initial patch or to have an future change introduce a
new one.
Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit 9099312 - Merge branch 'kmem-rework'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
Pool reference count is NOT checked in spa_export_common()
if the pool has been imported readonly=on, i.e. spa->spa_sync_on
is FALSE. Then zpool export and zfs list may deadlock:
1. Pool A is imported readonly.
2. zpool export A and zfs list are run concurrently.
3. zfs command gets reference on the spa, which holds a dbuf on
on the MOS meta dnode.
4. zpool command grabs spa_namespace_lock, and tries to evict dbufs
of the MOS meta dnode. The dbuf held by zfs command can't be
evicted as its reference count is not 0.
5. zpool command blocks in dnode_special_close() waiting for the
MOS meta dnode reference count to drop to 0, with
spa_namespace_lock held.
6. zfs command tries to get the spa_namespace_lock with a reference
on the spa held, which holds a dbuf on the MOS meta dnode.
7. Now zpool command and zfs command deadlock each other.
Also any further zfs/zpool command will block on spa_namespace_lock
forever.
The fix is to always check pool reference count in spa_export_common(),
no matter whether the pool was imported readonly or not.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2034
Cleanly destroying or exporting a pool requires that the pool
not be suspended. Therefore, set the POOL_CHECK_SUSPENDED flag
for these ioctls so the utilities will output a descriptive
error message rather than block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2878
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup. This was done to abstract
away any compatibility code which might be needed for the SPL.
As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#2985
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup. This was done to abstract
away any compatibility code which might be needed for the SPL.
As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2985
As described in flags section of open(2):
O_APPEND:
The file is opened in append mode. Before each write(2), the
file offset is positioned at the end of the file, as if with
lseek(2). O_APPEND may lead to corrupted files on NFS filesys-
tems if more than one process appends data to a file at once.
This is because NFS does not support appending to a file, so the
client kernel has to simulate it, which can't be done without a
race condition.
This issue was originally overlooked because normally the generic
VFS code handles this for a filesystem. However, because ZFS explictly
registers a zpl_write() function it's responsible for the seek.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3124
When loading the ZFS kernel modules they should not populate the
spa namespace using the cache file. This behavior isn't consistent
with other Linux kernel modules and we need to move away from it.
Removing this makes the whole startup process predictable with four
basic steps which are driven by the init system.
1) modprobe
2) zpool import
3) zfs mount
4) zfs share
This change also helps lay the groundwork for eventually removing
the kobj_* compatibility code on the kernel side. It may need to
be preserved in userspace because libzfs_init() depends on it.
This is why the conditional must be wrapped with an #ifdef _KERNEL.
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2820
When a bad DVA is encountered in metaslab_free_dva() the system
should treat it as fatal. This indicates that somehow a damaged
DVA was written to disk and that should be impossible.
However, we have seen a handful of reports over the years of pools
somehow being damaged in this way. Since this damage can render
otherwise intact pools unimportable, and the consequence of skipping
the bad DVA is only leaked free space, it makes sense to provide
a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module
option will cause the DVA to be ignored which may allow the pool to
be imported.
Since zfs_recover=0 by default any pool attempting to free a bad DVA
will treat it as a fatal error preserving the current behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3099
Issue #3090
Issue #2720
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.
This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2930
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code. They were introduced as a workaround
to avoid issue #2523. Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches. Today
this code works well and there's no compelling reason to keep
this functionality. In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)
Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3084
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold. Convert them to use vmem_alloc() so the
best allocator is used.
Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings. Concurrently the large allocation warning threshold
was increased from 8k to 32k. The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k. This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3057Closes#3079Closes#3081
Normally when importing a pool the space maps for all top level
vdevs are read from disk. The space maps will be required latter
when an allocation is performed and free blocks need to be located.
However, if the pool is imported readonly then we are guaranteed
that no allocations can occur. In this case the space maps need
not be loaded.. A similar argument can be made for the DTLs
(dirty time logs).
Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://github.com/illumos/illumos-gate/commit/2a89c2chttps://www.illumos.org/issues/5311
Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2970
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.
ln -s $(perl -e 'print "z" x 20') file
setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file
Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.
Also improve the comments in sa_find_sizes().
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3070
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2553Closes#2495
Currently, spl_hostid module parameter doesn't do anything, because it will
always be overwritten when calling into hostid_read().
Instead, we should only call into hostid_read() when spl_hostid is not zero,
just as the comment describes.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#427
Commit 7b2d78a046 fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller. This caused spurious
ENOMEM errors when reading the dbufs kstat.
This commit causes the actual number of characters written to be returned.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3072
Commit log from FreeBSD:
We have observed that arc_release() can be called concurrently with a
l2arc in-flight write. Also, we have observed that arc_hdr_destroy()
can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
flag in similar circumstances.
Previously the l2arc headers would be freed while leaking their
associated compression buffers. Now the buffers are placed on
l2arc_free_on_write list for delayed freeing. This is similar to
what was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.
In addition to fixing the discovered leaks this change also adds
some protective code to assert that a compression buffer associated
with a l2arc header is never leaked.
A new kstat l2_cdata_free_on_write is added. It keeps a count
of delayed compression buffer frees which previously would have
been leaks.
Tested by: Vitalij Satanivskij <satan@ukr.net> et al
Requested by: many
MFC after: 2 weeks
Sponsored by: HybridCluster / ClusterHQ
References:
https://illumos.org/issues/5222https://github.com/freebsd/freebsd/commit/b98f85dhttp://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781http://lists.open-zfs.org/pipermail/developer/2014-January/000455.htmlhttp://lists.open-zfs.org/pipermail/developer/2014-February/000523.html
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3029
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K. This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.
In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure. This way we're always sure the
right allocation functions are used. Notice that in the current
code kmem_free() and vmem_free() were both used. This happened to
work because these wrappers map to the same internal SPL function.
This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings. There
was no deadlock here just repeated allocation failures resulting
in increased latency.
Possible memory allocation deadlock: size=65792 lflags=0x42d0
Pid: 20118, comm: kvm Tainted: P O 3.2.0-0.bpo.4-amd64
Call Trace:
[<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl]
[<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl]
[<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs]
[<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs]
[<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs]
[<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs]
[<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde
[<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125
[<ffffffff81109055>] ? sys_pwritev+0x55/0x9a
[<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#3059
For performance reasons the reworked kmem code maps vmem_alloc() to
kmalloc_node() for allocations less than spa_kmem_alloc_max. This
allows for more concurrency in the system and less contention of
the virtual address space. Generally, this is a good thing.
However, in the case when the kmalloc_node() fails it makes little
sense to retry it using kmalloc_node() again. It will likely fail
in exactly the same way. A smarter strategy is to abandon this
optimization and retry using spl_vmalloc() which is very likely
to succeed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#428
Thank to commit a4430fce69 we're
now correctly returning EROFS when opening a zvol on a read-only
pool. Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().
In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1343
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1343
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3028
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.
It also brings this function almost back in sync with Illumos. This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2791
Issue #2781
The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat()
functions should all use the kmem_flags_convert() function to generate
the GFP_* flags. This ensures that they can be safely called in any
context and the correct flags will be used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#426
The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.
This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2975
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations. Instead a small dedicated
cache of preallocated SA buffers was kept.
This solution was viable while the maximum block size was limited
to 128K. But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc(). However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.
Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.
This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit 86dd0fd added preallocated I/O buffers. This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos. A
deadlock in this situation is no longer possible.
However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky. Either case is acceptable
here because we can safely abort the aggregation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings
us back in line with upstream. In some cases this means simply
swapping the flags back. For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.
The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory. This is again the
same as upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate. The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.
A careful reader will notice that not all callers have been changed
to vmem_alloc(). Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k. This is desirable because it minimizes the need
for Linux specific code changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim. This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction. For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The __get_free_pages() function must be used in place of kmalloc()
to ensure the __GFP_COMP is strictly honored. This is due to
kmalloc() being layered on the generic Linux slab caches. It
wasn't until recently that all caches were created using __GFP_COMP.
This means that it is possible for a kmalloc() which passed the
__GFP_COMP flag to be returned a non-compound allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation. This is
done because large slabs must be allocated using vmalloc(). It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored. This can result in a deadlock.
Therefore, a deadlock detection strategy was implemented to deal with
this case. When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab. Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.
This logic was correct but an optimization in the code allowed for a
deadlock. Because only slabs backed by vmalloc() can deadlock in the
way described above. An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches. This had the
advantage of making it easy to distinguish these objects when they
were freed.
But this isn't strictly safe. If all the spl_kmem_cache threads end
up deadlocked than we can't grow any of the other caches either. This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.
The fix here is to remove the optimization which limits this fall back
allocation stratagy to vmalloc() backed caches. Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
But this small cost can be mitigated by ignoring objects with virtual
addresses.
For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4, and made tunable. This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked. However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread doesn't block allocations from
other caches until the timeout is reached.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is designed to improve the memory utilization of
slabs by more carefully setting their size. The way the code
currently works is problematic for slabs which contain large
objects (>1MB). This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.
The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator. Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space. However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity. In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).
Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.
This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches. But depending on the workload it may result in slabs
being allocated and freed more frequently. In practice, this
has been shown to be a better default for most workloads.
Also the maximum slab size has been reduced to 4MB on 32-bit
systems. Due to the limited virtual address space it's critical
the we be as frugal as possible. A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.
Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments. Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10. The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly. By
reducing it we allow the system to resolve the issue more quickly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many people have noticed that the kmem cache implementation is slow
to release its memory. This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty. Unused
objects which are cached in the magazines will still prevent a slab
from being freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The comment above the Linux 3.16 kernel's clear_bit() states:
/**
* clear_bit - Clears a bit in memory
* @nr: Bit to clear
* @addr: Address to start counting from
*
* clear_bit() is atomic and may not be reordered. However, it does
* not contain a memory barrier, so if it is used for locking purposes,
* you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
* in order to ensure changes are visible on other processors.
*/
This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.
In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.
This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions. When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback. Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.
This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.
Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation. This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch achieves the following goals:
1. It replaces the preprocessor kmem flag to gfp flag mapping with
proper translation logic. This eliminates the potential for
surprises that were previously possible where kmem flags were
mapped to gfp flags.
2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
sized less than or equal to the newly-added spl_kmem_alloc_max
parameter. This ensures that small allocations will not contend
on a single global lock, large allocations can still be handled,
and potentially limited virtual address space will not be squandered.
This behavior is entirely different than under Illumos due to
different memory management strategies employed by the respective
kernels. However, this functionally provides the semantics required.
3. The --disable-debug-kmem, --enable-debug-kmem (default), and
--enable-debug-kmem-tracking allocators have been unified in to
a single spl_kmem_alloc_impl() allocation function. This was
done to simplify the code and make it more maintainable.
4. Improve portability by exposing an implementation of the memory
allocations functions that can be safely used in the same way
they are used on Illumos. Specifically, callers may safely
use KM_SLEEP in contexts which perform filesystem IO. This
allows us to eliminate an entire class of Linux specific changes
which were previously required to avoid deadlocking the system.
This change will be largely transparent to existing callers but there
are a few caveats:
1. Because the headers were refactored and extraneous includes removed
callers may find they need to explicitly add additional #includes.
In particular, kmem_cache.h must now be explicitly includes to
access the SPL's kmem cache implementation. This behavior is
different from Illumos but it was done to avoid always masking
the Linux slab functions when kmem.h is included.
2. Callers, like Lustre, which made assumptions about the definitions
of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
Other callers such as ZFS which did not will not require changes.
3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains
its original meaning of allowing allocations to access reserved
memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP.
4. The KM_NODEBUG flags has been retired and the default warning
threshold increased to 32k.
5. The kmem_virt() functions has been removed. For callers which
need to distinguish between a physical and virtual address use
is_vmalloc_addr().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Address all cstyle issues in the kmem, vmem, and kmem_cache source
and headers. This will done to make it easier to review subsequent
changes which will rework the kmem/vmem implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change introduces no functional changes to the memory management
interfaces. It only restructures the existing codes by separating the
kmem, vmem, and kmem cache implementations in the separate source and
header files.
Splitting this functionality in to separate files required the addition
of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions.
Additionally, several minor changes to the #include's were required to
accommodate the removal of extraneous header from kmem.h.
But again, while large this patch introduces no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range(). Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read(). The page must be unlocked before taking
the range lock. This patch corrects that issue.
In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter. Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.
zfs_flags (int)
Set additional debugging flags. The following flags may be
bitwise-or'd together.
+-------------------------------------------------------+
|Value Symbolic Name |
| Description |
+-------------------------------------------------------+
| 1 ZFS_DEBUG_DPRINTF |
| Enable dprintf entries in the debug log. |
+-------------------------------------------------------+
| 2 ZFS_DEBUG_DBUF_VERIFY * |
| Enable extra dbuf verifications. |
+-------------------------------------------------------+
| 4 ZFS_DEBUG_DNODE_VERIFY * |
| Enable extra dnode verifications. |
+-------------------------------------------------------+
| 8 ZFS_DEBUG_SNAPNAMES |
| Enable snapshot name verification. |
+-------------------------------------------------------+
| 16 ZFS_DEBUG_MODIFY |
| Check for illegally modified ARC buffers. |
+-------------------------------------------------------+
| 32 ZFS_DEBUG_SPA |
| Enable spa_dbgmsg entries in the debug log. |
+-------------------------------------------------------+
| 64 ZFS_DEBUG_ZIO_FREE |
| Enable verification of block frees. |
+-------------------------------------------------------+
| 128 ZFS_DEBUG_HISTOGRAM_VERIFY |
| Enable extra spacemap histogram verifications. |
+-------------------------------------------------------+
* Requires debug build.
Default value: 0.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2988
When running the SPLAT tests on a kernel with CONFIG_DEBUG_OBJECTS=y
enabled the following warning is generated.
ODEBUG: object is on stack, but not annotated
WARNING: at lib/debugobjects.c:300 __debug_object_init+0x221/0x480()
This is caused by the test cases placing a debug object on the stack
rather than the heap. This isn't harmful since they are small objects
but to make CONFIG_DEBUG_OBJECTS=y happy the objects have been relocated
to the heap. This impacted taskq tests 1, 3, and 7.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#424
Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.
Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.
These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock. To
prevent this we must always manipulate the writeback bit while
holding the range lock. The exact deadlock is as follows:
------ Process A ------ ------ Process B ------
zpl_writepages zpl_fallocate
write_cache_pages zpl_fallocate_common
zpl_putpage zfs_space
zfs_putpage (set bit) zfs_freesp
zfs_range_lock (wait on lock) zfs_free_range (take lock)
[has not yet initiated I/O, truncate_inode_pages_range
the bit will not be cleared] wait_on_page_writeback (wait on bit)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode. This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2812
Don't include the compatibility code in linux/*_compat.h in the public
header sys/types.h. This causes problems when an external code base
includes the ZFS headers and has its own conflicting compatibility code.
Lustre, in particular, defined SHRINK_STOP for compatibility with
pre-3.12 kernels in a way that conflicted with the SPL's definition.
Because Lustre ZFS OSD includes ZFS headers it fails to build due to a
'"SHRINK_STOP" redefined' compiler warning. To avoid such conflicts
only include the compat headers from .c files or private headers.
Also, for consistency, include sys/*.h before linux/*.h then sort by
header name.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#411
When the SPL was originally written Linux tracepoints were still
in their infancy. Therefore, an entire debugging subsystem was
added to facilite tracing which served us well for many years.
Now that Linux tracepoints have matured they provide all the
functionality of the previous tracing subsystem. Rather than
maintain parallel functionality it makes sense to fully adopt
tracepoints. Therefore, this patch retires the legacy debugging
infrastructure.
See zfsonlinux/zfs@bc9f413 for the tracepoint changes.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#408
Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL. However, we want to keep
the error handling in case this behavior changes in the futre.
Plus fix a small style issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#2703
Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages. Include a
few compatiblity headers explicitly to cope with that change. Also,
sort some linux-specific inclusions alphabetically.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2898
Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.
- The snprintf() function always produces a null-byte terminated string
for non-negative return values, so it is not necessary to write out a
null-byte as a separate step.
- Also, it is unsafe to use the return value of snprintf() as an offset
for placing a null-byte, because if the output was truncated the return
value is the number of bytes that _would_ have been written had enough
space been available. Therefore the return value may index beyond the
array boundaries.
- Finally, snprintf() accounts for the null-byte when limiting its output
size, so there is no need to pass it a size parameter that is one less
than the buffer size.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2875
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.
In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.
The only way to import such a pool would be to specify -o readonly=on
to the zpool import.
zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.
References:
https://github.com/freebsd/freebsd/commit/48431b7http://svnweb.freebsd.org/base?view=revision&revision=273158https://reviews.csiden.org/r/132/
Porting notes:
This issue was filed as illumos 5347 and a more comprehensive fix is
under review. Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2896
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written. This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.
The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:
setfattr -n user.test -v very_very_very..._long_value <file>
setfattr -n user.test -v short_value <file>
The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2663Closes#2700Closes#2701Closes#2717Closes#2863Closes#2884
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix a few dprintf format specifiers that disagreed with their argument
types. These came to light as compiler errors when converting dprintf
to use the Linux trace buffer. Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.
This increase in hash table size adds approximating 0.5M to
our fixed memory footprint. This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized. In the meanwhile, this small
change significantly reduces contention for certain workloads.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#1291
These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL. In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2892
As long as we can fit a minimum of one object/slab there's no reason
to prevent the creation of the cache. This effectively pushes the
maximum object size up to 32MB. The splat cache tests were extended
accordingly to verify this functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.
Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#2822
When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.
Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2822
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects. It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.
This patch adds support for the new @scan_objects return value semantics.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#2837
This has a few benefits. First, it fixes a regression that "Rework
generic memory allocation interfaces" appears to have triggered in
splat's slab_reap and slab_age tests. Second, it makes porting code from
Illumos to ZFSOnLinux easier. Third, it has the side effect of making
reclaim from slab caches that specify reclaim functions an order of
magnitude faster. The splat slab_reap test usually took 30 to 40
seconds. With this change, it takes 3 to 4.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #369
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects. It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.
This patch adds support for the new @scan_objects return value semantics
and updates the splat shrinker test case appropriately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#403
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5164https://www.illumos.org/issues/5165https://github.com/illumos/illumos-gate/commit/b1be289
Porting Notes:
The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.
The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz. The upstream commit incorrectly
used space_map_blksize.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2697
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4958https://github.com/illumos/illumos-gate/commit/2a104a5
Porting notes:
Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present
in Linux but not yet in *BSD.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2697
The general strategy used by ZFS to verify that blocks are valid is
to checksum everything. This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block. If a blocks checksum is valid then its contents are
trusted by the higher layers.
This system works exceptionally well as long as bad data is never
written with a valid checksum. If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.
One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.
To prevent this from happening the arc_read() function has been
updated to detect this specific case. If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2678
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp(). Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose. Rather than emulating those interfaces we remove the
redundant checks.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2804
5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5162https://github.com/illumos/illumos-gate/commit/8a90470
Porting notes:
Fix spelling error 's/arena/area/' in dmu.c.
In restore_write() declare bonus and abuf at the top of the function.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2696
When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!
Steps to reproduce:
* zpool create test c1t1d0
* zfs create test/fs
* zfs snapshot test/fs@a
* zfs clone test/fs@a test/clone
* zfs destroy -d test/fs@a
* zfs clone test/fs@a test/clone2
* zfs snapshot test/clone2@a
* zfs hold hld test/clone2@a
* zfs release hld test/clone2@a
* zfs list -r -t all test
<test/clone2@a has been destroyed>
We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.
5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/projects/illumos-gate//issues/5150https://github.com/illumos/illumos-gate/commit/42fcb65
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2690
Restore_object should not use two transactions to restore an object:
* one transaction is used for dmu_object_claim
* another transaction is used to set compression, checksum and most
importantly bonus data
* furthermore dmu_object_reclaim internally uses multiple transactions
* dmu_free_long_range frees chunks in separate transactions
* dnode_reallocate is executed in a distinct transaction
The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups. Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.
3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon
References:
https://www.illumos.org/issues/3693https://github.com/illumos/illumos-gate/commit/e77d42e
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2689
In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL. If no ACL exists, this
processing shouldn't be done. Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:
# create a file with typical mode of 664
echo test > test
chown anyuser test
ls -l test
and the mode will show up as all zeroes. Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1264
Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://github.com/illumos/illumos-gate/commit/b8289d2https://www.illumos.org/issues/3756
Porting notes:
The static function zfs_prop_activate_feature() was removed because
this change removes the only caller. The function was not removed
from Illumos but instead left as dead code. However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.
Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1540
The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc()
to allocate enough memory to hold the vectorized IO. While this
allocation will be small it's been observed in practice to sometimes
slightly exceed the 8K warning threshold by a few kilobytes.
Therefore, the KM_NODEBUG flag has been added to suppress warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2774
The kern_path() function has been available since Linux 2.6.28.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kvasprintf() function has been available since Linux 2.6.22.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of Linux 2.6.32 the proc handlers where updated to expect only
five arguments. Therefore there is no longer a need to maintain
this compatibility code and this infrastructure can be simplified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The groups_search() function was never exported by a mainline kernel
therefore we drop this compatibility code and always provide our own
implementation.
Additionally, the cred_t structure has been available since 2.6.29
so there is no longer a need to maintain compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This function has never been exported by any mainline and was only
briefly available under RHEL5. Therefore this check is being removed
and the code update to always use the wrapper function.
The next step will be to eliminate all this code. If ZFS were updated
not to assume that it's pwd was / there would be no need for this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The user_path_dir() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
After the removable of get_vmalloc_info(), the unused global memory
variables, and the optional dcache/icache shrinkers there is no
longer a need for the kallsyms compatibility code. This allows
us to eliminate another brittle area of the code by removing the
kernel upcall this functionality depended on for older kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is optional functionality which may or may not be useful to
ZFS when using older kernels. It is never a hard requirement.
Therefore this functionality is being removed from the SPL and
a simpler slimmed down version will be added to ZFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Platforms such as Illumos and FreeBSD have historically provided
global variables which summerize the memory state of a system.
Linux on the otherhand doesn't expose any of this information
to kernel modules and uses entirely different mechanisms for
memory management.
In order to simplify the original ZFS port to Linux these global
variables were emulated by the SPL for the benefit of ZFS. As ZoL
has matured over the years it has moved steadily away from these
interfaces and now no longer depends on them at all.
Therefore, this patch completely removes the global variables
availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree,
and swapfs_reserve. This greatly simplifies the memory management
code and eliminates a common area of confusion.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The get_vmalloc_info() function was used to back the vmem_size()
function. This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.
However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The inode structure has used i_mutex as its internal locking
primitive since 2.6.16. The compatibility code to check for
the previous semaphore primitive has been removed. However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many of the time functions had grown overly complex in order to
handle kernel compatibility issues. However, as of Linux 2.6.26
all the required functionality is available. This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64(). This allows us
to provide an optimized implementation and simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19. There is no longer any reason to maintain this
compatibility code. There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly. This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The generic SPL cache shrinkers make the assumption that the
caches only contain VFS cache data and therefore should be scaled
based on vfs_cache_pressure. This is not strictly true and it
should not be assumed.
Removing this tuning should not have any impact on the stock
behavior because vfs_cache_pressure=100 by default. This means
that no scaling will take place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the SPL was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency throughout the code update the SPLAT infrastructure
to use the wrapped mutex interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev. In
this case the child should be skipped and the read issues to
another member of the mirror.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1744
Modify the code to use the utsname() kernel function rather than
a global variable. This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2). This means that it
will behave consistently in both contexts.
This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions. And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.
Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
This functionality is optional and until Linux 3.0, which
provided per-filesystem shinkers, they was never a reasonable
interface. Therefore, this functionality is being dropped
for earlier kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
When ZPIOS was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
This is a debug patch designed to ensure an error code is logged
to the console when this VERIFY() is hit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #1440
Commit e022864 introduced a regression for kernels which are built
with CONFIG_DEBUG_PREEMPT. The use of CPU_SEQID in a preemptible
context causes zio_nowait() to trigger the BUG. Since CPU_SEQID
is simply being used as a random index the usage here is safe. To
resolve the issue preempt is disable while calling CPU_SEQID.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2769
5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5176https://github.com/illumos/illumos-gate/commit/6f834bc
Porting notes:
Under Linux max_ncpus is defined as num_possible_cpus(). This is
largest number of cpu ids which might be available during the life
time of the system boot. This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2711
While running SPLAT on a kernel with CONFIG_DEBUG_ATOMIC_SLEEP
enabled the taskq:front was flagged as a test which might sleep
which in an unsafe context. Specifically, the splat_vprint()
function which internally takes a mutex was being called under
a spin lock. Moving the log function outside the spin lock
cleanly solves this issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.
26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.
This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
As an attempt to perform the page truncation more optimally, the
hole-punching support added in 223df0161f
truncated performed the operation in two steps: first, sub-page "stubs"
were zeroed under the range lock in zfs_free_range() using the new
zfs_zero_partial_page() function and then the whole pages were truncated
within zfs_freesp(). This left a window of opportunity during which
the full pages could be touched.
This patch closes the window by moving the whole-page truncation into
zfs_free_range() under the range lock.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2733
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5138https://github.com/illumos/illumos-gate/commit/af3465d
Porting notes:
Because support for exposing a uint64_t parameter wasn't added
until v3.17-rc1 the zfs_free_max_blocks variable has been declared
as a unsigned long. This is already far larger than required and
it allows us to avoid additional autoconf compatibility code.
The default value has been set to 100,000 on Linux instead of
ULONG_MAX which is used on Illumos. This was done to limit the
number of outstanding IOs in the system when snapshots are destroyed.
This helps ensure individual TXG sync times are kept reasonable and
memory isn't wasted managing a huge backlog of outstanding IOs.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2675Closes#2581
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4753https://github.com/illumos/illumos-gate/commit/73527f4
Comments by Matt Ahrens from the issue tracker:
When a sync task is waiting for a txg to complete, we should hurry
it along by increasing the number of outstanding async writes
(i.e. make vdev_queue_max_async_writes() return a larger number).
Initially we might just have a tunable for "minimum async writes
while a synctask is waiting" and set it to 3.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2716
5139 SEEK_HOLE failed to report a hole at end of file
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5139https://github.com/illumos/illumos-gate/commit/0fbc0cd
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2714
LLVM's static analyzer reported that we could pass an uninitialized
pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct.
An attempt to repurpose a spare or L2ARC drive from an exported pool
will cause the pool_guid passed to spa_by_guid() to be unintialized
information from the stack. This will cause non-deterministic behavior.
Since there is no reason why we cannot repurpose such disks, we modify
vdev_inuse() to avoid calling spa_by_guid() when they are detected.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
5161 add tunable for number of metaslabs per vdev
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5161https://github.com/illumos/illumos-gate/commit/bf3e216
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2698
The smp_mb__{before,after}_clear_bit functions have been renamed
smp_mb__{before,after}_atomic. Rather than adding a compatibility
function to handle this the code has been updated to use smp_wmb().
This has the advantage of being a stable functionally equivalent
interface. On many architectures smp_mb__after_clear_bit() expands
to smp_wmb(). Others might be able to do something slightly more
efficient but this will be safe and correct on all of them.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#386
5177 remove dead code from dsl_scan.c
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5177https://github.com/illumos/illumos-gate/commit/5f37736
Porting notes:
The local variable 'buf' was removed from dsl_scan_visitbp().
This wasn't part of the original patch but it should have been.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2712
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/projects/illumos-gate//issues/5140https://github.com/illumos/illumos-gate/commit/2243853
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2676
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock. This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().
Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock. Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2670
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.
Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation. It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.
Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.
Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#2619
5117 space map reallocation can cause corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/projects/illumos-gate/issues/5117https://github.com/illumos/illumos-gate/commit/e503a68
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2662
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object. This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().
The good news is that is a situation which can not occur unless the
pool has been damaged. The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools. Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.
Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object. If a non-ZAP
object has been passed it EINVAL will be returned immediately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2597
Issue #2602
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.
ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.
One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:
1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.
2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.
3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.
4. Making an asynchronous write synchronous is non sequitur.
Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.
It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.
Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.
It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:
```
Which operations are cancelable is implementation-defined.
```
http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html
The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.
Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility. We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#223Closes#2373
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.
Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2652.
Closes#2525.
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2639
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.
Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4970https://www.illumos.org/issues/4971https://www.illumos.org/issues/4972https://www.illumos.org/issues/4973https://www.illumos.org/issues/4974https://github.com/illumos/illumos-gate/commit/e42d205
Notes:
This set of patches adds a set of tunable parameters for the
"extreme rewind" mode of pool import which allows control over
the traversal performed during such an import.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2598
5034 ARC's buf_hash_table is too small
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5034https://github.com/illumos/illumos-gate/commit/63e911b
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2615
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2158
4631 zvol_get_stats triggering too many reads
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4631https://github.com/illumos/illumos-gate/commit/bbfa8ea
Ported-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2612Closes#2480
Illumos 4982 added code to metaslab_fragmentation() to proactively update
space maps when the spacemap_histogram feature is enabled. This should
only happen when the pool is writeable.
References:
https://www.illumos.org/issues/4982https://github.com/illumos/illumos-gate/commit/2e4c998
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2595
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4976https://www.illumos.org/issues/4978https://www.illumos.org/issues/4979https://www.illumos.org/issues/4980https://www.illumos.org/issues/4981https://www.illumos.org/issues/4982https://www.illumos.org/issues/4983https://www.illumos.org/issues/4984https://github.com/illumos/illumos-gate/commit/2e4c998
Notes:
The "zdb -M" option has been re-tasked to display the new metaslab
fragmentation metric and the new "zdb -I" option is used to control
the maximum number of in-flight I/Os.
The new fragmentation metric is derived from the space map histogram
which has been rolled up to the vdev and pool level and is presented
to the user via "zpool list".
Add a number of module parameters related to the new metaslab weighting
logic.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2595
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.
Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#370
This reverts commit 7973e46 which brings the basic flow of the
code back in line with the other ZFS implementations. This
was possible due to the following related changes.
e89260a Directory xattr znodes hold a reference on their parent
6f9548c Fix deadlock in zfs_zget()
0a50679 Add zfs_iput_async() interface
4dd1893 Avoid 128K kmem allocations in mzap_upgrade()
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457Closes#2058Closes#2128Closes#2240
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks. When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457
As originally implemented the mzap_upgrade() function will
perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc().
These large allocations can potentially block indefinitely
if contiguous memory is not available. Since this allocation
is done under the zap->zap_rwlock it can appear as if there is
a deadlock in zap_lockdir(). This is shown below.
The optimal fix for this would be to rework mzap_upgrade()
such that no large allocations are required. This could be
done but it would result in us diverging further from the other
implementations. Therefore I've opted against doing this
unless it becomes absolutely necessary.
Instead mzap_upgrade() has been updated to use zio_buf_alloc()
which can reliably provide buffers of up to SPA_MAXBLOCKSIZE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Close#2580
Linux kernel 3.17 removes the action function argument from
wait_on_bit(). Add autoconf test and compatibility macro to support
the new interface.
The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule(). This API change
was made to consolidate all of those redundant wait functions.
References: torvalds/linux@7431620
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#378
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage. Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.
To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure. All
access must be serialized through the vq_lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2572
For small objects the Linux slab allocator should be used to make the most
efficient use of the memory. However, large objects are not supported by
the Linux slab and therefore the SPL implementation is preferred. A cutoff
of 16K was determined to be optimal for architectures using 4K pages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #356Closes#379
Reinstate the correct default behavior of returning the number of objects
in the cache for reclaim. This behavior was disabled in recent releases
to do occasional reports of spinning in shrink_slabs(). Those issues have
been resolved and can no longer can be reproduced. See commit 376dc35.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #358Closes#379
The dsl_dataset_rollback_check() function is executed in the
txg_sync context. To prevent a potential deadlock due to direct
memory reclaim it must use KM_PUSHPAGE. This was introduced by
the recent 'zfs bookmark' features, commit da53684.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Dillmann <eric@jave.fr>
Closes#2569
4914 zfs on-disk bookmark structure should be named *_phys_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4914https://github.com/illumos/illumos-gate/commit/7802d7b
Porting notes:
There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated. This should reduce the likelihood of further
problems like issue #2094 from occurring.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2558
4881 zfs send performance degradation when embedded block pointers
are encountered
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4881https://github.com/illumos/illumos-gate/commit/06315b7
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2547
4897 Space accounting mismatch in L2ARC/zpool
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
From the illumos issue tracker:
L2ARC vdev space usage statistics are calculated as the delta
between the maximum and minimum vdev offset ever written to
by the L2ARC fill thread, but do not inform the user of how
much space in between these two offsets is actually taken up by
cached buffers. This fix changes that so that vdev space usage
stats on L2ARC devices accurately track the volume of buffers
stored on them, allowing users to see the exact L2ARC usage in
"zpool iostat -v".
References:
https://www.illumos.org/issues/4897https://github.com/illumos/illumos-gate/commit/3038a2b
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2555
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4757https://www.illumos.org/issues/4913https://github.com/illumos/illumos-gate/commit/5d7b4d4
Porting notes:
For compatibility with the fastpath code the zio_done() function
needed to be updated. Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2544
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Description from Matt Ahrens's bug report at Delphix:
Add a new zfs property, "redundant_metadata" which can have values
"all" or "most". The default will be "all", which is the current
behavior. Setting to "most" will cause us to only store 1 copy of
level-1 indirect blocks of user data files.
Additional notes:
The new man page section for this property states
"The exact behavior of which metadata blocks
are stored redundantly may change in future releases."
and:
"When set to most, ZFS stores an extra copy of most types of
metadata. This can improve performance of random writes,
because less metadata must be written."
The current implementation is as described above in Matt's blog.
It is controlled by a new global integer
"zfs_redundant_metadata_most_ditto_level", currently initialized
to 2. When "redundant_metadata" is set to "most", only indirect
blocks of the specified level and higher will have additional ditto
blocks created.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2542
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4754https://www.illumos.org/issues/4755https://github.com/illumos/illumos-gate/commit/b6240e8
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2533
4374 dn_free_ranges should use range_tree_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4374https://github.com/illumos/illumos-gate/commit/bf16b11
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2531
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4370https://www.illumos.org/issues/4371https://github.com/illumos/illumos-gate/commit/43466aa
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2529
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4171https://www.illumos.org/issues/4172https://github.com/illumos/illumos-gate/commit/2acef22
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2528
Added highbit64() and howmany() which are used in recent upstream
code. Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#363
There have been issues in the past where excessive debug logging
to the console has resulted in significant performance impacts.
In the vast majority of these cases only a few stack traces are
required to diagnose the issue. Therefore, stack traces dumped to
the console will now we limited to 5 every 60s.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#374
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2488
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.
References:
https://www.illumos.org/issues/4756https://github.com/illumos/illumos-gate/commit/30beaff
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4730https://github.com/illumos/illumos-gate/commit/be08211
Porting notes:
Under ZFSonlinux, one of the effects of not destroying the taskq is that
zdb would never exit (due to the SPL taskq implementation).
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
This changes moves the called to metaslab_group_alloc_update() to the
metaslab_sync_reassess() function. The original placement of the call
within metaslab_sync_done() appears to have been a simple mistake,
introduced by ac72fac3ea.
This aligns us more closely to the upstream illumos code base.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot. If they do somehow get
dirtied an attempt will make made to write them to disk. In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2405
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4168https://www.illumos.org/issues/4169https://www.illumos.org/issues/4170https://github.com/illumos/illumos-gate/commit/7fdd916
Porting notes:
Of particular interest when troubleshooting corrupted pools, the
commonly-used "zdb -e" operation may perform verbatim imports and
furthermore, it will soon have direct support for verbatim imports via
a new "-V" option. The 4169 fix eliminates a common segfault case in
which spa_history_log_version() tries to access an un-opened dsl_pool_t.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2451Closes#2283Closes#2467
The parameter was added as illumos issue 4081 which was committed to
zfsonlinux in ac72fac3ea. This patch
documents the parameter and allows for it to be set as a module parameter.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2483
If the smaller of 2 different sized child vdev's of a mirrored vdev is
detached, and the pool has the autoexpand property set to off, as the
remaining larger vdev is promoted to a top level vdev it fails to retain
the asize of the original top level mirror vdev and therefore partially
autoexpands.
This partially autoexpanded state leaves the new vdev too large to
re-mirror by adding the smaller vdev back in, and the pool fails to
utilize the space until next imported.
If the autoexpand property is set to on, the child vdev grows
in size after it has been promoted to a top level vdev as expected.
This commit causes the remaining child mirror to retain the asize of its
old parent mirror vdev if the autoexpand property is set to off,
this allows the smaller vdev to be re-added if required the vdev
can then be told to expand if required by the usual using zpool online -e.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#1208
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.
The following macros have been converted to not use do/while(0):
PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL
PANIC has been converted to a wrapper around the new spl_PANIC() function.
The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().
The __ASSERT() macro was not touched. It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
4936 lz4 could theoretically overflow a pointer with a certain input
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported by: Tim Chase <tim@chase2k.com>
References:
https://illumos.org/issues/4936https://github.com/illumos/illumos-gate/commit/58d0718
Porting notes:
This fixes the widely-reported "20-year-old vulnerability" in
LZO/LZ4 implementations which inherited said bug from the reference
implementation.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2429
Added several comments regarding the removal of real_LZ4_uncompress()
which exists in the reference implementation but has been removed here
since it's not used.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 91579709fc which
limited the asynchronous dispatch to kernel space. We want to do
this for two reasons:
1) While we have slightly more headroom in user space excessively
deep stacks have been observed while running ztest, see #2293.
2) Removing this conditional makes the pipeline behave consistently
regardless of if it's executing in kernel space or user space.
This way we're more likely to uncover subtle issues with ztest.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2384
We should not override the default memory type of the kmem cache. This
was done previously to force certain objects which were slightly over
object size limit cut off in to KMC_KMEM caches for better performance.
The zfsonlinux/spl#356 patch slightly increases the default cut off
from 511 bytes 1024 bytes for x86_64. This means there is long longer
a need to override the default for the caches. And since the default
values are now being used the new spl_kmem_cache_slab_limit and
spl_kmem_cache_kmem_limit tunables will apply to all kmem caches.
The following is a list of caches that will be impacted:
| object size | forced type | default type
----------------- | ------------- | ------------- | --------------
dnode_t | 936 bytes | KMC_KMEM | KMC_KMEM
zio_cache | 1104 bytes | *KMC_KMEM | *KMC_VMEM
zio_link_cache | 48 bytes | KMC_KMEM | KMC_KMEM
zio_vdev_cache | 131088 bytes | KMC_VMEM | KMC_VMEM
zio_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM
zio_data_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM
zio_buf_1024 | 1024 bytes | KMC_KMEM | KMC_KMEM
zio_data_buf_1024 | 1024 bytes | +KMC_VMEM | +KMC_KMEM
* Cache memory type will change from KMC_KMEM to KMC_VMEM.
+ Cache memory type will change from KMC_VMEM to KMC_KMEM.
This patch removes another slight point of divergence between ZoL
and Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2337
The correct behavior for all registered shrinkers is to return the
number of objects in their cache. In theory this allows the Linux
VM to balance memory reclaim across all registered caches.
In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim. This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim. Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.
Therefore, this patch makes this behavior tunable. Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once. This is the default behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#358
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL. These include:
1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.
Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator. This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on. However, there are some things we need to be careful of:
1) The Linux slab allocator was never designed to work well with
large objects. Because the SPL slab must still handle this use
case a cut off limit was added to transition from Linux slab
backed objects to kmem or vmem backed slabs.
spl_kmem_cache_slab_limit - Objects less than or equal to this
size in bytes will be backed by the Linux slab. By default
this value is zero which disables the Linux slab functionality.
Reasonable values for this cut off limit are in the range of
4096-16386 bytes.
spl_kmem_cache_kmem_limit - Objects less than or equal to this
size in bytes will be backed by a kmem slab. Objects over this
size will be vmem backed instead. This value defaults to
1/8 a page, or 512 bytes on an x86_64 architecture.
2) Be aware that using the Linux slab may inadvertently introduce
new deadlocks. Care has been taken previously to ensure that
all allocations which occur in the write path use GFP_NOIO.
However, there may be internal allocations performed in the
Linux slab which do not honor these flags. If this is the case
a deadlock may occur.
The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us. And ideally
need to move completely away from using the SPLs slab for large
memory allocations. This patch is a first step.
NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
backed caches but it is not supported for kmem/vmem backed caches.
2) Regardless of the spl_kmem_cache_*_limit settings a cache may
be explicitly set to a given type by passed the KMC_KMEM,
KMC_VMEM, or KMC_SLAB flags during cache creation.
3) The constructors, destructors, and reclaim callbacks are all
functional and will be called regardless of the cache type.
4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
the issues involved in presenting correct object accounting.
Instead they will appear in /proc/slabinfo under the same names.
5) Several kmem SPLAT tests needed to be fixed because they relied
incorrectly on internal kmem slab accounting. With the updated
test cases all the SPLAT tests pass as expected.
6) An autoconf test was added to ensure that the __GFP_COMP flag
was correctly added to the default flags used when allocating
a slab. This is required to ensure all pages in higher order
slabs are properly refcounted, see ae16ed9.
7) When using the SLUB allocator there is no need to attempt to
set the __GFP_COMP flag. This has been the default behavior
for the SLUB since Linux 2.6.25.
8) When using the SLUB it may be desirable to set the slub_nomerge
kernel parameter to prevent caches from being merged.
Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#356
For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.
Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2336
In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA. This problem could be
reproduced when xattr=sa enabled as follows:
ln -s $(perl -e 'print "x" x 120') blah
setfattr -n security.selinux -v blahblah -h blah
The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]
Adding the SA xattr will attempt to extend the registration to:
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]
but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:
[ ZPL_SYMLINK ZPL_DXATTR ]
This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.
This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.
The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.
Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2301
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.
We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2270
Detect the updated vfs_rename() interface and call it with an
extra flags argument.
References:
torvalds/linux@520c8b1
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail. However, the NULL dereference at
clearly shows that under certain circumstances it is possible. Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.
BUG: unable to handle kernel NULL pointer dereference at 00000570
crash> struct -o vdev
struct vdev {
[0] uint64_t vdev_id;
... ...
[1376] uint64_t vdev_deflate_ratio;
Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes#1707Closes#1987Closes#1891
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Today the metaslab_debug logic performs two tasks:
- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync
This change provides knobs for each of these independently.
References:
https://illumos.org/issues/4101https://github.com/illumos/illumos-gate/commit/0713e23
Notes:
1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2227
When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated. These spill blocks are currently considered data
blocks. However, they should be accounted for as meta data
because they are effectively an extension of the dnode.
While this may seem like a minor accounting issue it has broader
implications. The key thing to be aware of is that each spill
block will hold a reference on its parent dnode. The dnode in
turn holds a reference on its dbuf in the dnode object. This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data. This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.
However, unlike the small file case a spill block can legitimately
be considered meta data. By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached. This then allows the dnodes and dbufs which the spill
block was pinning to be released.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2294
As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space. If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned. Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.
/*
* Assign tx to a transaction group. txg_how can be one of:
*
* (1) TXG_WAIT. If the current open txg is full, waits until there's
* a new one. This should be used when you're not holding locks.
* It will only fail if we're truly out of space (or over quota).
* ...
*/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2287
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().
Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method. That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.
References:
https://bugs.gentoo.org/show_bug.cgi?id=483516
Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1691
Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.
This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2277
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.
The implementation based on assumption that task queues initialization
performed in serial manner.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#336
When using __get_free_pages to get high order memory, only the first page's
_count will set to 1, other's will be 0. When an internal page get passed into
rbd, it will eventully go into tcp_sendpage. There, it will be called with
get_page and put_page, and get freed erroneously when _count jump back to 0.
The solution to this problem is to use compound page. All pages in a
high order compound page share a single _count. So get_page and put_page in
tcp_sendpage will not cause _count jump to 0.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#251
Endianness detection in LZ4 is broken in user-space builds. This
bug corrupts compressed data and manifests itself in several ztest
failures. When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.
The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system. On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.
While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#1963Fixes#1964Fixes#1965
Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue. All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.
In this case the code used strfree() to release the memory allocated
by kmem_alloc(). Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.
To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory. Like strfree(), strdup() is
not integrated with the memory accounting. This means we can use
strfree() to release it like Illumos.
SPL: kmem leaked 10/4368729 bytes
address size data func:line
ffff880067e9aa40 10 ZZZZZZZZZZ zfsdev_ioctl:5655
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#2262
This behavior is more consistent with the way memory reclaim
is expected to work under Linux.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#349
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2142
By default maximal number of objects in slab can't exceed (16*2 - 1) and slab
size can't exceed 32M.
Today's high end servers having couple hundreds of RAM available for ARC may
run into a trouble with virtual memory because of the restriction mentioned
above.
Problem:
Reasons for very high number of virtual memory allocations:
* Real slab size very small relative to the size of the entire RAM
* Slabs allocated on virtual memory and fill entire ARC
The result is very high number of allocated virtual memory ranges (hundreds of
ranges). When virtual memory subsystem manages high number of ranges its
performance become so poor that it freezes from time to time.
Solution:
Number of objects per slab should be increased taking into account maximal
slab size which can also be increased if needed.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#337
When comparing times gotten from ddi_get_lbolt, we have to take account of
wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should
use 't1 - t2 < 0'.
This patch add ddi_time_after and friends to address this issue. They have
strict type restriction, clock_t for vanilla and int64_t for 64 version, to
prevent type conversion from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#335
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist
Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#224
The kernel's kthread_create() function is defined as "..." and there is
no va_list variant at the moment. The task name is pre-formatted into
a local buffer and passed to kthread_create() with fixed arguments.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#347
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.
This reverts commit e26ade5101.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
The function was defined as a static inline with variable arguments
which causes gcc to generate errors on some distros.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#346
Provide spl_kthread_create() as a wrapper to the kernel's kthread_create()
to provide pre-3.13 semantics. Re-try if the call is interrupted or if it
would have returned -ENOMEM. Otherwise return NULL.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#339
Due to certain assumptions made in the the cred:groupmember test it
could result in false positives when run on specific distributions.
This was solely a bug in the test case and not in the groupmember()
function which the test case was validating.
To prevent future false positives the test case has been rewritten
to be both more rigerous and to make fewer assumptions about the
system.
Minor style cleanup was done to cr_groups_search() and groupmember()
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By setting __GFP_NORETRY the kernel memory reclaim logic was allowed to
abort early and dump a falled allocation stack to the console. Since
this was done in a tight loop to fill memory it could result in a large
number of stacks being dumped to the console. This in turn slowed down
the test sufficiently so it exceeded the time limit and failed.
To resolve this issue the __GFP_NORETRY flag is being removed. This is
how it should have been called originally to ensure we're simulating
the behavior of most callers which will use the GFP_KERNEL flag.
In addition, the reclaim granularity of 1000 objects was far to coarse
for this to be a realistic test. For kmem:slab_reclaim there might
only be a few thousand objects total in the cache. Therefore, the
SPLAT_KMEM_OBJ_RECLAIM constant for these tests was lowered. This
will cause the reclaim callback to run more frequently which makes
for a better test case.
The frequency of the cache reaping in kmem:slab_reap was increased
to accommodate the reduced number of objects released during the
reclaim.
These changes only impact the test cases and were done to remove
false positives caused by the test case itself.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted. If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources. Unfortunately, this introduced another deadlock.
INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
Tainted: P O 3.13.6 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
Call Trace:
[<ffffffff813dd1d4>] schedule+0x24/0x70
[<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[<ffffffff811608c5>] ilookup+0x65/0xd0
[<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[<ffffffff811071ec>] do_writepages+0x1c/0x40
[<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[<ffffffff8106072c>] process_one_work+0x16c/0x490
[<ffffffff810613f3>] worker_thread+0x113/0x390
[<ffffffff81066edf>] kthread+0xdf/0x100
This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks. Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used. This gives us the information that ilookup() provided without
the risk of a deadlock.
Alternately, this race could be closed by registering an sops->drop_inode()
callback. The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.
To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev. This covers both io and checksum zevents but
may be used but other scripts.
In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:
vdev_spare_paths - List of vdev paths for all hot spares.
vdev_spare_guids - List of vdev guids for all hot spares.
vdev_read_errors - Read errors for the problematic vdev
vdev_write_errors - Write errors for the problematic vdev
vdev_cksum_errors - Checksum errors for the problematic vdev.
By default the required hot spare scripts are installed but this
functionality is disabled. To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.
These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system. It also requires that the
autoexpand policy be ported from Illumos, see:
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c
Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location. When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor. Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor. When the file descriptor is
closed the kernel state tracking the cursor is destroyed.
This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents. By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2189
Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,
This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66, which
ported the Illumos 3744 changes.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2156
4088 use after free in arc_release()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4088illumos/illumos-gate@ccc22e1304
From the illumos issue:
A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2162
The spa_label_features nvlist is used in the sync context during pool
version upgrade.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2168
These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign(). This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.
One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded. However, this case should not often arise in practice,
because most callers currently use one of these forms:
dmu_tx_assign(tx, TXG_WAIT);
dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed. Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.
Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram. Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1915
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.
Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead. To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io(). This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io(). To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().
Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.
- The sync task restructuring commit
13fe019870 removed the code
that bumpted dmu_tx_quota. The counter is now bumped in two
cases, instead of just the one case as before (after the result
of dsl_dataset_check_quota call). The second case is where
we check the requested reservation against the actual pool size,
as this is an implicit quota of sorts.
- The write throttle restructuring commit
e8b96c6007 makes dmu_tx_how and
dmu_tx_inflight obsolete, so they are removed.
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1914
Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.
A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.
This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.
Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2150
4574 get_clones_stat does not call zap_count in non-debug kernel
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/4574illumos/illumos-gate@03d1795fa6
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2147
The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2141
Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2069
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..
Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).
This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.
If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.
So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.
To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).
Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:
- 16K for the dnode object's block - [ 16384 bytes]
- N * sizeof(dnode_t) -------------- [ N * 928 bytes]
- (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes]
- (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
- (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]
To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:
Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
Pinned "arc_meta_used" bytes = 17000 + N * 1544
This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.
Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:
Total = 17000 + N * 1544 + S * N
^^^^^^^^^^^^^^^^ ^^^^^
metadata data
So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).
File Sizes (S)
| 512 | 1024 | 2048 | 4096 | 8192 | 16384 |
---+----------+----------+----------+----------+----------+----------+
1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |
As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.
This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.
To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.
This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".
Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.
In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
This reverts commit c11a12bc3b.
Out of memory events were fixed by reverting this patch.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).
What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:
commit 302f753f16
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Tue Mar 13 14:29:16 2012 -0700
Integrate ARC more tightly with Linux
Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.
This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.
In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.
1) Large systems with over a 1TB of memory are being deployed
and reserving 1/32 of this memory (32GB) as the mimimum
requirement is overkill.
2) Tiny systems like the raspberry pi may only have 256MB of
memory in which case 64MB is far too large.
The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose. If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit. Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.
The additional field has been removed by the previous commit. However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied. This will return the pools to a state in which they
may be imported by other implementations.
The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted. A message similar to one of the following means your
pool must be scrubbed to restore compatibility.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
however there is a compatibility issue which should be corrected
by running 'zpool scrub'
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
$ zpool status
pool: zol-0.6.2-173
state: ONLINE
scan: pool compatibility issue detected.
see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...
If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported. Further advice on how to proceed will be
provided by the error message as follows.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
active asynchronous destroy. Revert to an earlier version and
allow the destroy to complete before updating.
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:
zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w
Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.
The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys. Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.
This patch does have one limitation that should be mentioned. It will not
detect errata #2 on a pool unless errata #1 is also present. It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.
End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL. The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root. This will display a non-zero
value for any pool with an active asynchronous destroy.
Lastly, it is expected that no user data has been lost as a result of
this erratum.
Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool. These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate. The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation. Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.
To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.
To add a new errata the following changes must be made:
* A new errata identifier must be assigned by adding a new enum value
to the zpool_errata_t type. New enums must be added to the end to
preserve the existing ordering.
* Code must be added to detect the issue. This does not strictly
need to be done at pool import time but doing so will make the
errata visible in 'zpool import' as well as 'zpool status'. Once
detected the spa->spa_errata member should be set to the new enum.
* If possible code should be added to clear the spa->spa_errata member
once the errata has been resolved.
* The show_import() and status_callback() functions must be updated
to include an informational message describing the errata. This
should include an action message describing what an administrator
should do to address the errata.
* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
updated to describe the errata. This space can be used to provide
as much additional information as needed to fully describe the errata.
A link to this documentation will be automatically generated in the
output of 'zpool import' and 'zpool status'.
Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
Commit 1421c89142 added a field to
zbookmark_t that unintentinoally caused a disk format change. This
negatively affected backward compatibility and platform portability.
Therefore, this field is being removed.
The function that field permitted is left unimplemented until a later
patch that will reimplement the field in a way that does not affect the
disk format.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
Various errors can occur when registering property callbacks. As the
author's comments indicate, the code is very paranoid about preserving
the first-seen error when registering callbacks. This patch causes an
error seen while registering the "relatime" callback to not clobber a
previously-seen error.
Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2117
Commit e0b0ca9 accidentally corrupted the l2_asize displayed in
arcstats. This was caused by changing the l2arc_buf_hdr.b_asize
member from an int to uint32_t type. There are places in the
code where this field is cast to a uint64_t resulting in the
b_hits member being treated as part of b_asize.
To resolve the issue the type has been changed to a uint64_t,
and the b_hits member is placed after the enum to prevent the
size of the structure from increasing.
This is a good example of exactly why it's a bad idea to use
ambiguous types (int) in these structures.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1990
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Refences:
https://www.illumos.org/issues/4188illumos/illumos-gate@bb411a08b0
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2091
Add the "relatime" property. When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access. This behavior
is compatible with the Linux "relatime" mount option.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2064Closes#1917
When transitioning current open TXG into QUIESCE state and opening
a new one txg_quiesce() calls gethrtime():
- to mark the birth time of the new TXG
- to record the SPA txg history kstat
- implicitely inside spa_txg_history_add()
These timestamps are practically the same, so that the first one
can be used instead of the other two. The only visible difference
is that inside spa_txg_history_add() the time spent in kmem_zalloc()
will be counted towards the opened TXG.
Since at this point the new TXG already exists (tx->tx_open_txg
has been already incremented) it is actually a correct accounting.
In any case this extra work is only happening when spa_txg_history
kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect
the normal processing in any way.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Issue #2075
In several cases when digging into kstats we can found two txgs
in SYNC state, e.g.
txg birth state nreserved nread nwritten ...
985452 258127184872561 C 0 373948416 2376272384 ...
985453 258129016180616 C 0 378173440 28793344 ...
985454 258129016271523 S 0 0 0 ...
985455 258130864245986 S 0 0 0 ...
985456 258130867458851 O 0 0 0 ...
However only first txg (985454) is really syncing at this moment.
The other one (985455) marked as SYNCED is actually in a post-QUIESCED
state and waiting to start sync. So, the new TXG_STATE_WAIT_FOR_SYNC
state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to
reveal this situation.
txg birth state nreserved nread nwritten ...
1086896 235261068743969 C 0 163577856 8437248 ...
1086897 235262870830801 C 0 280625152 822594048 ...
1086898 235264172219064 S 0 0 0 ...
1086899 235264936134407 W 0 0 0 ...
1086900 235264936296156 O 0 0 0 ...
Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2075
From the comment in the commit:
Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES
SA in which case ENOENT is returned from zfs_acl_node_read() when the
SA can't be located. Allow chown/chgrp to succeed in these cases rather
than returning an error that makes no sense in the context of the caller.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfs-osx/zfs#86
Closes#1911Closes#2029
The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on. In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1928
Under Linux the zvol_set_volsize() function was originally written
to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the
dmu_objset_own()/dmu_objset_disown() interfaces were added but
the ZVOL code wasn't updated to take advantage of them.
This was never an issue but after the dsl_pool_config changes
the code now takes the config lock twice. The cleanest solution
is to shift to using dmu_objset_own() which takes a long hold
on the dataset and does not hold the dsl pool lock.
This patch also slightly restructures the existing code such
that it more closely resembles the upstream Illumos code.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2039
It's unsafe to drain the iput taskq while holding the z_teardown_lock
as a writer. This is because when the last reference on an inode is
dropped it may still have pages which need to be written to disk.
This will be done through zpl_writepages which will acquire the
z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're
holding the lock as a writer in zfs_sb_teardown the unmount will
deadlock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#1988
This change was proposed for Sparc but it's not clear to me
why it's required. Proper support exists in the lz4 code to
detect the endianness and the required builtins are available
for gcc. Still I'm including the patch because it will only
impact Sparc and it may resolve a case which hasn't occured
to me.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
On Sparc sp->blksize will be a 64-bit value which is then cast
incorrectly to a 32-bit value. For big endian systems this
results in an incorrect value for sp->blksize. To resolve the
problem local variables of the correct size are used and then
assigned to sp->blksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
The mis-aligned memory accesses in nvpair_native_embedded() and
nvpair_native_embedded_array() will cause a 'Bus Error' for
architectures such as Sparc which not fully byte addressible.
To avoid this issue care is taken to avoid dereferencing the
potentially mis-aligned packed nvlist_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
When accessing the zp->z_mode through the SA bulk interface we
expect that 64-bits are available to hold the result. However,
on 32-bit platforms mode_t will only be 32-bits so we cannot
pass it to SA_ADD_BULK_ATTR(). Instead a local uint64_t variable
must be used and the result assigned to zp->z_mode.
This went unnoticed on 32-bit little endian platforms because
the bytes happen to end up in the correct 32-bits. But on big
endian platforms like Sparc the zp->z_mode will always end up
set to zero.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
When this code was written it appears to have been assumed that
every taskq would have a large number of threads. In this case
it would make sense to attempt to evenly bind the threads over
all available CPUs. However, it failed to consider that creating
taskqs with a small number of threads will cause the CPUs with
lower ids become over-subscribed.
For this reason the kthread_bind() call is being removed and
we're leaving the kernel to schedule these threads as it sees fit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#325
Back the allocations for ddt tables+entries and l2arc headers with
kmem caches. This will reduce the cost of allocating these commonly
used structures and allow for greater visibility of them through the
/proc/spl/kmem/slab interface.
Signed-off-by: John Layman <jlayman@sagecloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1893
Move the libzfs_fini() after the zpool_log_history() call so the
ZPOOL_HIST_CMD entry can get written.
Fix the handling of saved_poolname in zfsdev_ioctl()
which was broken as part of the stack-reduction work in
a168788053.
Since ZoL destroys the TSD data in which the previously successful
ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY
ioctl has a very important restriction: it can only successfully write
a long entry following a successful ioctl() if no intervening vops have
been performed. Some of zfs subcommands do perform intervening vops and
to do the logging themselves. At the moment, the "create" and "clone"
subcommands have been modified appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1998
During the update process in sa_modify_attrs(), the sizes of existing
variably-sized SA entries are obtained from sa_lengths[]. The case where
a variably-sized SA was being replaced neglected to increment the index
into sa_lengths[], so subsequent variable-length SAs would be rewritten
with the wrong length. This patch adds the missing increment operation
so all variably-sized SA entries are stored with their correct lengths.
Previously, a size-changing update of a variably-sized SA that occurred
when there were other variably-sized SAs in the bonus buffer would cause
the subsequent SAs to be corrupted. The most common case in which this
would occur is when a mode change caused the ZPL_DACL_ACES entry to
change size when a ZPL_DXATTR (SA xattr) entry already existed.
The following sequence would have caused a failure when xattr=sa was in
force and would corrupt the bonus buffer:
open(filename, O_WRONLY | O_CREAT, 0600);
...
lsetxattr(filename, ...); /* create xattr SA */
chmod(filename, 0650); /* enlarges the ACL */
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1978
This change is related to commit 81eaf15 which ensured the correct
allocation handlers were installed for nvlist_alloc(). The nvlist
functions nvlist_dup(), nvlist_pack(), and nvlist_unpack() suffer
from the same issue and have been updated accordingly.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1937
Four new dataset properties have been added to support SELinux. They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8). When one of
these properties is set to something other than 'none'. That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.
For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:
$ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media
This will ensure the filesystem is automatically mounted with that
rootcontext. It is equivalent to manually specifying the rootcontext
with the -o option like this:
$ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media
By default all four contexts are set to 'none'. Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1504
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written. Others were
caused when the common code was slightly adjusted for Linux.
This patch contains no functional changes. It only refreshes
the code to conform to style guide.
Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request. The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1821
The comment in zfs_close states that "Under Linux the zfs_close() hook
is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close
is associated with every successful struct file creation/deletion, which
should always be balanced.
Here is an example of what's wrong:
Process A B
open(O_SYNC)
z_sync_cnt = 1
open(O_SYNC)
z_sync_cnt = 2
close()
z_sync_cnt = 0
So z_sync_cnt is 0 even if B still has the file with O_SYNC.
Also moves the generic_file_open call before zfs_open to ensure that in
the case generic_file_open fails z_sync_cnt is not incremented. This
is safe because generic_file_open has no side effects.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1962
Update zvol.c to conform to the style guidelines, verified by
running cstyle.pl on the source file. This patch contains
no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #1821
In order to minimize any future disruption caused by the addition
and removal /dev/zfs ioctls this patch makes the following changes.
1) Sync ZoL's ioctl ordering such that it matches Illumos. For
historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID
ioctls were out of order.
2) Move Linux and FreeBSD specific ioctls in to their own reserved
ranges. This allows us to preserve the existing ordering when
new ioctls are added by either Illumos or FreeBSD. When an
ioctl is no longer needed it should be retired in place.
This change alters the ZFS user/kernel ABI so make sure you rebuild
both your user and kernel modules. However, it should allow for a
much stabler interface going forward.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1973
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace. This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel. This significantly simplified the code.
However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux. This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel. Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes. This
will minimize, but not eliminate, the disruption to end users.
Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code. While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.
NOTES:
1) This patch introduces one subtle change in behavior which
could not be easily avoided. Prior to this change callers
of 'zfs create -V ...' were guaranteed that upon exit the
/dev/zvol/ block device link would be created or an error
returned. That's no longer the case. The utilities will no
longer block waiting for the symlink to be created. Callers
are now responsible for blocking, this is why a 'udev_wait'
call was added to the 'label' function in scripts/common.sh.
2) The read-only behavior of a ZVOL now solely depends on if
the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant
policy setting in the gendisk structure was removed. This
both simplifies the code and allows us to safely leverage
set_disk_ro() to issue a KOBJ_CHANGE uevent. See the
comment in the code for futher details on this.
3) Because __zvol_create_minor() and zvol_alloc() may now be
called in a sync task they must use KM_PUSHPAGE.
References:
illumos/illumos-gate@681d9761e8
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#1969
4121 vdev_label_init should treat request as succeeded when pool
is read only
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/4121illumos/illumos-gate@973c78e94b
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1863
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime. However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.
This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.
It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.
The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.
The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.
Signed-off-by: david.chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1941
This reverts commit 6a7c0ccca4.
A proper fix for Issue #1648 was landed under Issue #1890, so this is no
longer needed.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1648
Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header. This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.
The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode. The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.
A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648). Note the
corrupt link target name:
$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567
Commit 6a7c0ccca4 worked around this bug
by forcing xattr's on symlinks to be stored in directory format. This
change implements a proper fix, so the workaround can now be reverted.
The reference link below contains a reproducer for FreeBSD.
References:
http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1890
Use the standard Linux MODULE_VERSION macro to expose the installed
spl and splat module versions. This will also automatically add a
checksum of the .c files and headers in "srcversion". See:
/sys/module/spl/version
/sys/module/spl/srcversion
/sys/module/splat/version
/sys/module/splat/srcversion
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#1923
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented. Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.
However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist. Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.
There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist. This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.
This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Booker <https://github.com/gbooker>
Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes#1426Closes#481
Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion". See:
/sys/module/zavl/version
/sys/module/zavl/srcversion
/sys/module/znvpair/version
/sys/module/znvpair/srcversion
/sys/module/zunicode/version
/sys/module/zunicode/srcversion
/sys/module/zcommon/version
/sys/module/zcommon/srcversion
/sys/module/zfs/version
/sys/module/zfs/srcversion
/sys/module/zpios/version
/sys/module/zpios/srcversion
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1923
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045illumos/illumos-gate@69962b5647
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1913
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.
Porting Notes:
zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4347illumos/illumos-gate@e722410c49
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As part of zfs_sb_teardown() there is an assertion that all inodes
which are part of the zsb->z_all_znodes list have at least one
reference on them. This is always true for the standard unmount
case but there are two other cases where it is not strictly true.
* zfs_ioc_rollback() - This is the most common case and it results
from the fact that we aren't unmounting the filesystem. During a
normal unmount the MS_ACTIVE flag will be cleared on the super block
causing iput_final() to evict the inode when its reference count
drops to zero. However, during a rollback MS_ACTIVE remains set
since we're rolling back a live filesystem and need to preserve the
existing super block. This allows inodes with a zero reference count
to stay in the cache thereby violating the assertion.
* destroy_inode() / zfs_sb_teardown() - There exists a small race
between dropping the last reference on an inode and removing it from
the zsb->z_all_znodes list. This is unlikely to occur but could also
trigger the assertion which is incorrect. The inode may safely have
a zero reference count in this case.
Since allowing a zero reference count on the inode is expected and
safe for both of these cases the simplest thing to do is remove the
ASSERT. This code is only enabled for default builds so removing
this entirely is a very safe change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#1417Closes#1536
This should hopefully catch the rest of the allocations in the
user hold/release processing that were missed by commit
65c67ea86e.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1852Closes#1855
This check was originally added for SLES10, a093c6a, to check for
a 'struct vfsmount *' argument which they added. However, since
SLES10 is based on a 2.6.16 kernel which is no longer supported
this functionality was dropped. The checks were refactored to
support Linux 3.13 without concern for historical versions.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#312
GCC 4.8.1 complained about an unused flags variable when building
against Linux 2.6.26.8:
/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:
In function ‘__cv_init’:
/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:39:6:
error: variable ‘flags’ set but not used
[-Werror=unused-but-set-variable]
int flags = KM_SLEEP;
^
cc1: all warnings being treated as errors
Additionally, the superfluous code uses a preempt_count variable that is
no longer available on Linux 3.13. Deleting the unnecessary code fixes a
Linux 3.13 compatibility issue.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#312
Currently, using msync() results in the following code path:
sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1849Closes#907
Because ZFS bypasses the page cache we don't inherit per-task I/O
accounting for free. However, the Linux kernel does provide helper
functions allow us to perform our own accounting. These are most
commonly used for direct IO which also bypasses the page cache, but
they can be used for the common read/write call paths as well.
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#313Closes#1275
Under Linux this restriction does not apply because we have access
to all the required devices.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1631
Properly initialize SELinux xattrs for all inode types. The
initial implementation accidentally only did this for files.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1832
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp(). This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath. To minimize the stack usage
for this call path the following changes were made:
1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
to prevent them from being inlined by gcc. This reduced the
stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
vdev_mirror_io_start() from 144 to 128 bytes.
2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
from the stack to the heap. This reduced the stack usage of
zfsdev_ioctl() from 368 to 112 bytes.
3) The major saving came from slimming down traverse_visitbp() from
from 224 to 144 bytes. Since this function is called recursively
the 80 bytes saved per invokation adds up. The following changes
were made:
a) The 'hard' local variable was replaced by a TD_HARD() macro.
b) The 'pd' local variable was replaced by 'td->td_pfd' references.
c) The zbookmark_t was moved to the heap. This does cost us an
additional memory allocation per recursion by that cost should
still be minimal. The cost could be further reduced by adding
a dedicated zbookmark_t slab cache.
d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
restructured to use the minimum amount of stack. This includes
removing the 'cbp' local variable.
Overall for the offending use case roughly 1584 of total stack space
has been saved. This is enough to avoid overflowing the stack on
stock kernels with 8k stacks. See #1778 for additional details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1778
Commit 95fd54a1c5 restructured the
hold/release processing and moved some of the work into the sync task.
A number of nvlist allocations now need to use KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1852Closes#1855
The Illumos #3875 patch reverted a part of ZoL's 7b3e34b which added
special-case error handling for zfs_rezget(). The error handling dealt
with the case in which an all-ones object number ended up being passed
to dnode_hold() and causing an EINVAL to be returned from zfs_rezget().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1859Closes#1861
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1819
The required Posix ACL interfaces are only available for kernels
with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix
ACL support for these kernels. All major distribution kernels
enable CONFIG_FS_POSIX_ACL by default.
If your kernel does not support Posix ACLs the following warning
will be printed at ZFS module load time.
"ZFS: Posix ACLs disabled by kernel"
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1825
A couple of kmem_alloc() allocations were using KM_SLEEP in
the sync thread context. These were accidentally introduced
by the recent set of Illumos patches. The solution is to
switch to KM_PUSHPAGE.
dsl_dataset_promote_sync() -> promote_hold() -> snaplist_make() ->
kmem_alloc(sizeof (*snap), KM_SLEEP);
dsl_dataset_user_hold_sync() -> dsl_onexit_hold_cleanup() ->
kmem_alloc(sizeof (*ca), KM_SLEEP)
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3995 Memory leak of compressed buffers in l2arc_write_done
References:
https://illumos.org/issues/3995
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1688
Issue #1775
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4168https://www.illumos.org/issues/4169https://www.illumos.org/issues/4170illumos/illumos-gate@7fdd916c47
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/4082illumos/illumos-gate@5253393b09
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
4046 dsl_dataset_t ds_dir->dd_lock is highly contended
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4046illumos/illumos-gate@b62969f868
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. This commit removed dsl_dataset_namelen in Illumos, but that
appears to have been removed from ZFSOnLinux in an earlier commit.