Starting from Linux 4.7, get_acl will set acl cache pointer to temporary
sentinel value before calling i_op->get_acl. Therefore we can't compare
against ACL_NOT_CACHED and return.
Since from Linux 3.14, get_acl already check the cache for us, so we
disable this in zpl_get_acl.
Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4944Closes#4946
The posix_acl_valid() function has been updated to require a
user namespace. Filesystem callers should normally provide the
user_ns from the super block associcated with the ACL; the
zpl_posix_acl_valid() wrapper has been added for this purpose.
See https://github.com/torvalds/linux/commit/0d4d717f for
complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4922
Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING
configure checks, all supported kernel provide this functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4922
Kernel 4.8 paved the way to enabling mounting a file system inside a
non-init user namespace. To facilitate this a s_user_ns member was
added holding the userns in which the filesystem's instance was
mounted. This enables doing the uid/gid translation relative to
this particular username space and not the default init_user_ns.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4928
Recent 4.X kernels prior to 4.6 require #include of spinlock.h in
order to get the definition of __ARCH_SPIN_LOCK_UNLOCKED which is
used by DEFINE_MUTEX().
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#566
Kernel 4.7 added the option to trim the unused exported symbols. In
my testing this showed to be problematic since the PDE_DATA function
was considered unused and as such was trimmed. This in turn caused the
respective test during spl's configure stage to falsely detect that
PDE_DATA is not defined, which in turn caused build failures later.
Handle this situation by adding detection whether CONFIG_TRIM_UNUSED_KSYMS
is enabled and refuse to build against a kernel which has it enabled
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#565
The rw argument has been removed from submit_bio/submit_bio_wait.
Callers are now expected to set bio->bi_rw instead of passing it
in. See https://github.com/torvalds/linux/commit/4e49ea4a for
complete details.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type. A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro. See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#563
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer. This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.
In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b. In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.
The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency. This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.
To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.
For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4858
Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f
Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4717
Issue #4665
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.
This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes#554
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.
Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4664Closes#4665
The GPL test for posix_acl_release() didn't include <linux/module.h>.
Also run this test only when posix_acl_release() exists.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4665
Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
inode_lock_shared to do exclusive and shared lock respectively.
We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
kernel you'll always take an exclusive lock.
We also add all other inode_lock friends. And nested users now should
explicitly call spl_inode_lock_nested with correct subclass.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4665
Closes#549
While OpenSolaris libc and glibc both include XDR support, the musl libc
does not in favor of depending on the BSD-licensed libtirpc library.
Adding support is a simple matter of detecting the library, including
the headers and linking against it. By default libtirpc will be checked
for and if available used. Otherwise, configure will fall back to using
the xdr implementation provided by libc if available. The options
--with-tirpc/--without-tirpc can be used to disable this checking.
In addition, the xdr_control() function has been simplied to only
handle ZFSs specific use case.
Original-patch-by: stf <s@ctrlc.hu>
Original-patch-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Closes#2254Closes#4559
To reduce mutex footprint, we detect the existence of owner in kernel mutex,
and rely on it if it exists.
Note that before Linux 3.0, mutex owner is of type thread_info. Also note
that, in Linux 3.18, the condition for owner is changed from
CONFIG_DEBUG_MUTEXES || CONFIG_SMP to
CONFIG_DEBUG_MUTEXES || CONFIG_MUTEX_SPIN_ON_OWNER
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#540
Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to
whole name rather than prefix should use "name" instead of "prefix".
Otherwise, kernel will return with EINVAL when it tries to resolve handlers.
Also, we remove the strcmp checks when xattr_handler has name, because
xattr_resolve_name will do the check for us.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4549Closes#4537
Accidentally introduced by commit e4023e4. The AM_CONDITIONAL
cannot be located where it can be invoked conditionally, as in
the `--with-config=user` case. Relocate it to the top level
ZFS_AC_CONFIG macro along with the other AM_CONDITIONALs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4416
Accidentally introduced by commit 39fc0cb. The devname2devid utility
which depends on libudev must only be built when libudev headers are
available. This is accomplished through an AM_CONDITIONAL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4416
This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2856Closes#3978Closes#4416
This is initial support for x86 vectorized implementations of ZFS parity
and checksum algorithms.
For the compilation phase, configure step checks if toolchain supports relevant
instruction sets. Each implementation must ensure that the code is not passed
to compiler if relevant instruction set is not supported. For this purpose,
following new defines are provided if instruction set is supported:
- HAVE_SSE,
- HAVE_SSE2,
- HAVE_SSE3,
- HAVE_SSSE3,
- HAVE_SSE4_1,
- HAVE_SSE4_2,
- HAVE_AVX,
- HAVE_AVX2.
For detecting if an instruction set can be used in runtime, following functions
are provided in (include/linux/simd_x86.h):
- zfs_sse_available()
- zfs_sse2_available()
- zfs_sse3_available()
- zfs_ssse3_available()
- zfs_sse4_1_available()
- zfs_sse4_2_available()
- zfs_avx_available()
- zfs_avx2_available()
- zfs_bmi1_available()
- zfs_bmi2_available()
These function should be called once, on module load, or initialization.
They are safe to use from user and kernel space.
If an implementation is using more than single instruction set, both compiler
and runtime support for all relevant instruction sets should be checked.
Kernel fpu methods:
- kfpu_begin()
- kfpu_end()
Use __get_cpuid_max and __cpuid_count from <cpuid.h>
Both gcc and clang have support for these. They also handle ebx register
in case it is used for PIC code.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4381
I noticed during code review of zfsonlinux/zfs#4385 that the author of a
commit had peppered the various Makefile.am files with `$(TIRPC_LIBS)`
when putting it into `lib/libspl/Makefile.am` should have sufficed. Upon
further examination, it seems that he had copied what we do with
`$(ZLIB)`. We also have a bit of that with `-ldl` too. Unfortunately,
what we do is wrong, so lets fix it to set a good example for future
contributors.
In addition, we have multiple `-lz` and `-luuid` passed to the compiler
because each `AC_CHECK_LIB` adds it to `$LIBS`. That is somewhat
annoying to see, so we switch to `AC_SEARCH_LIBS` to avoid it. This is
consistent with the recommendation to use `AC_SEARCH_LIBS` over
`AC_CHECK_LIB` by autotools upstream:
https://www.gnu.org/software/autoconf/manual/autoconf-2.66/html_node/Libraries.html
In an ideal world, this would translate into improvements in ELF's
`DT_NEEDED` entries, but that is not the case because of a couple of
bugs in libtool.
The first bug causes libtool to overlink by using static link
dependencies for dynamic linking:
https://wiki.mageia.org/en/Overlinking_issues_in_packaging#libtool_issues
The workaround for this should be to pass `-Wl,--as-needed` in
`LDFLAGS`. That leads us to the second bug, where libtool passes
`LDFLAGS` after the libraries are specified and `ld` will only honor
`--as-needed` on libraries specified before it:
https://sigquit.wordpress.com/2011/02/16/why-asneeded-doesnt-work-as-expected-for-your-libraries-on-your-autotools-project/
There are a few possible workarounds for the second bug. One is to
either patch the compiler spec file to specify `-Wl,--as-needed` or pass
`-Wl,--as-needed` via `CC` like `CC='gcc -Wl,--as-needed'` so that it is
specified early. Another is to patch ltmain.sh like Gentoo does:
https://gitweb.gentoo.org/repo/gentoo.git/tree/eclass/ELT-patches/as-needed
Without one of those workarounds, this cleanup provides no benefit in
terms of `DT_NEEDED` entry generation. It should still be an improvement
because it nicely simplifies the code while encouraging good habits when
patching autotools scripts.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4426
Add the ZFS Test Suite and test-runner framework from illumos.
This is a continuation of the work done by Turbo Fredriksson to
port the ZFS Test Suite to Linux. While this work was originally
conceived as a stand alone project integrating it directly with
the ZoL source tree has several advantages:
* Allows the ZFS Test Suite to be packaged in zfs-test package.
* Facilitates easy integration with the CI testing.
* Users can locally run the ZFS Test Suite to validate ZFS.
This testing should ONLY be done on a dedicated test system
because the ZFS Test Suite in its current form is destructive.
* Allows the ZFS Test Suite to be run directly in the ZoL source
tree enabled developers to iterate quickly during development.
* Developers can easily add/modify tests in the framework as
features are added or functionality is changed. The tests
will then always be in sync with the implementation.
Full documentation for how to run the ZFS Test Suite is available
in the tests/README.md file.
Warning: This test suite is designed to be run on a dedicated test
system. It will make modifications to the system including, but
not limited to, the following.
* Adding new users
* Adding new groups
* Modifying the following /proc files:
* /proc/sys/kernel/core_pattern
* /proc/sys/kernel/core_uses_pid
* Creating directories under /
Notes:
* Not all of the test cases are expected to pass and by default
these test cases are disabled. The failures are primarily due
to assumption made for illumos which are invalid under Linux.
* When updating these test cases it should be done in as generic
a way as possible so the patch can be submitted back upstream.
Most existing library functions have been updated to be Linux
aware, and the following functions and variables have been added.
* Functions:
* is_linux - Used to wrap a Linux specific section.
* block_device_wait - Waits for block devices to be added to /dev/.
* Variables: Linux Illumos
* ZVOL_DEVDIR "/dev/zvol" "/dev/zvol/dsk"
* ZVOL_RDEVDIR "/dev/zvol" "/dev/zvol/rdsk"
* DEV_DSKDIR "/dev" "/dev/dsk"
* DEV_RDSKDIR "/dev" "/dev/rdsk"
* NEWFS_DEFAULT_FS "ext2" "ufs"
* Many of the disabled test cases fail because 'zfs/zpool destroy'
returns EBUSY. This is largely causes by the asynchronous nature
of device handling on Linux and is expected, the impacted test
cases will need to be updated to handle this.
* There are several test cases which have been disabled because
they can trigger a deadlock. A primary example of this is to
recursively create zpools within zpools. These tests have been
disabled until the root issue can be addressed.
* Illumos specific utilities such as (mkfile) should be added to
the tests/zfs-tests/cmd/ directory. Custom programs required by
the test scripts can also be added here.
* SELinux should be either is permissive mode or disabled when
running the tests. The test cases should be updated to conform
to a standard policy.
* Redundant test functionality has been removed (zfault.sh).
* Existing test scripts (zconfig.sh) should be migrated to use
the framework for consistency and ease of testing.
* The DISKS environment variable currently only supports loopback
devices because of how the ZFS Test Suite expects partitions to
be named (p1, p2, etc). Support must be added to generate the
correct partition name based on the device location and name.
* The ZFS Test Suite is part of the illumos code base at:
https://github.com/illumos/illumos-gate/tree/master/usr/src/test
Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6Closes#1534
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2448
Both Alpine Linux and Gentoo use OpenRC so we share its logic
Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4386
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4251
The registered xattr .list handler was simplified in the 4.5 kernel
to only perform a permission check. Given a dentry for the file it
must return a boolean indicating if the name is visible. This
differs slightly from the previous APIs which also required the
function to copy the name in to the provided list and return its
size. That is now all the responsibility of the caller.
This should be straight forward change to make to ZoL since we've
always required the caller to make the copy. However, this was
slightly complicated by the need to support 3 older APIs. Yes,
between 2.6.32 and 4.5 there are 4 versions of this interface!
Therefore, while the functional change in this patch is small it
includes significant cleanup to make the code understandable and
maintainable. These changes include:
- Improved configure checks for .list, .get, and .set interfaces.
- Interfaces checked from newest to oldest.
- Strict checking for each possible known interface.
- Configure fails when no known interface is available.
- HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency
with similar iops and fops configure checks.
- POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to
move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}.
Compatibility wrapper were added for old kernels.
- ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing
ZPL_XATTR_{GET|SET} WRAPPERs. Only the inode is guaranteed to be
a valid pointer, passing NULL for the 'list' and 'name' variables
is allowed and must be checked for. All .list functions were
updated to use the wrapper to aid readability.
- zpl_xattr_filldir() updated to use the .list function for its
permission check which is consistent with the updated Linux 4.5
interface. If a .list function is registered it should return 0
to indicate a name should be skipped, if there is no registered
function the name will be added.
- Additional documentation from xattr(7) describing the correct
behavior for each namespace was added before the relevant handlers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
The follow_link() interface was retired in favor of get_link().
In the process of phasing in get_link() the Linux kernel went
through two different versions. The first of which depended
on put_link() and the final version on a delayed done function.
- Improved configure checks for .follow_link, .get_link, .put_link.
- Interfaces checked from newest to oldest.
- Strict checking for each possible known interface.
- Configure fails when no known interface is available.
- Both versions .get_link are detected and supported as well
two previous versions of .follow_link.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
3557 dumpvp_size is not updated correctly when a dump zvol's size is changed
3558 setting the volsize on a dump device does not return back ENOSPC
3559 setting a volsize larger than the space available sometimes succeeds
3560 dumpadm should be able to remove a dump device
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
https://www.illumos.org/issues/3559https://github.com/illumos/illumos-gate/commit/c61ea56
Porting notes:
- Internal zvol.c changes not applied due to implementation differences.
The external interface and behavior was already consistent with the
latest upstream code.
- Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All
supported kernels (2.6.32 and newer) provide this interface.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4217
This reverts commit 61bbbd9a77 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507
While stack size will vary by architecture it has historically defaulted to
8K on x86_64 systems. However, as of Linux 3.15 the default thread stack
size was increased to 16K. These kernels are now the default in most non-
enterprise distributions which means we no longer need to assume 8K stacks.
This patch takes advantage of that fact by appropriately reverting stack
conservation changes which were made to ensure stability. Changes which
may have had a negative impact on performance for certain workloads. This
also has the side effect of bringing the code slightly more in line with
upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4059
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.
This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system. Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.
This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.
Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
Commit torvalds/linux@4246a0b63b
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.
bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592 ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later. We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3741
This autotools check was never needed because we can check for the
existence of QUEUE_FLAG_NONROT in the kernel headers.
Also, the comment in config/kernel-blk-queue-nonrot.m4 is incorrect.
This was a Linux 2.6.28 API change, not a Linux 2.6.27 API change.
Signed-off-by: Richard Yao <ryao@gentoo.org>
This autotools check was never needed because we can check for the
existence of QUEUE_FLAG_DISCARD in the kernel headers.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <ryao@gentoo.org>
This is needed for supporting kernels earlier than 2.6.30. Support for
those kernels was dropped, so we can safely remove this check.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is needed for supporting kernels earlier than 2.6.30. Support for
those kernels was dropped, so we can safely remove this check.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels. And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients. This patch makes
the following core improvements.
* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.
* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process. This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.
* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.
* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds. This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.
* Snapshots in active use will not be automatically unmounted. As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.
* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated. This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots. This patch modifies the
behavior such that only negative dentries are invalidated. This solves
the issue and may result in a performance improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3589Closes#3344Closes#3295Closes#3257Closes#3243Closes#3030Closes#2841
Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2.
We use zfs_{kmap,kunmap}_atomic as wrappers and always take 2 argument, but
ignore the 2nd for newer kernel.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Pull struct backing_dev_info off the stack: by linux-4.1 it's grown
past our 1024 byte stack frame warning limit resulting in an incorrect
configure result.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#3671
This is some minor fixes to commits 2cac7f5f11
and 2a34db1bdb.
* Make sure to alien'ate the new initramfs rpm package as well!
The rpm package is build correctly, but alien isn't run on it to
create the deb.
* Before copying file from COPY_FILE_LIST, make sure the DESTDIR/dir exists.
* Include /lib/udev/vdev_id file in the initrd.
* Because the initrd needs to use '/sbin/modprobe' instead of 'modprobe',
we need to use this in load_module() as well.
* Make sure that load_module() can be used more globaly, instead of
calling '/sbin/modprobe' all over the place.
* Make sure that check_module_loaded() have a parameter - module to
check.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3626
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:
Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,
Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issues #463
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure \
--with-spl=$HOME/src/git/spl/ \
--with-spl-obj=$HOME/src/git/spl/build
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1082
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082
As of Linux 4.2 the kernel has completely retired the nameidata
structure. One of the few remaining consumers of this interface
were the follow_link() and put_link() callbacks.
This patch adds the required checks to configure to detect the
interface change and updates the functions accordingly. Migrating
to the simple_follow_link() interface was considered but was decided
against ironically due to the increased complexity.
It also should be noted that the kernel follow_link() and put_link()
interfaces changes several times after 4.1 and but before 4.2. This
means there is a narrow range of kernel commits which never appear
in an official tag of the Linux kernel which ZoL will not build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3596
As of gcc version 5.1.1 a new boolean comparison warning has been
introduced. This warning is harmless but is triggered several places
in the ZFS code base. Because warnings are promoted to errors when
building with debugging enabled it is necessary to disable the warning
when using versions of gcc which automatically enabling this check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
* Supports booting of a ZFS snapshot.
Do this by cloning the snapshot into a dataset. If this, the resulting
dataset, already exists, destroy it. Then mount it on root.
* If snapshot does not exist, use base dataset (the part before '@')
as boot filesystem instead.
* If no snapshot is specified on the 'root=' kernel command line, but there
is an '@', then get a list of snapshots below that filesystem and ask the
user which to use.
* Clone with 'mountpoint=none' and 'canmount=noauto' - we mount manually
and explicitly.
* For sub-filesystems, that doesn't have a mountpoint property set, we use
the 'org.zol:mountpoint' to keep track of it's mountpoint.
* Allow rollback of snapshots instead of clone it and boot from the clone.
* Allow mounting a root- and subfs with mountpoint=legacy set
* Allow mounting a filesystem which is using nativ encryption.
* Support all currently used kernel command line arguments
All the different distributions have their own standard on what to specify
on the kernel command line to boot of a ZFS filesystem.
* Extra options:
* zfsdebug=(on,yes,1) Show extra debugging information
* zfsforce=(on,yes,1) Force import the pool
* rollback=(on,yes,1) Rollback (instead of clone) the snapshot
* Only try to import pool if it haven't already been imported
* This will negate the need to force import a pool that have not been exported cleanly.
* Support exclusion of pools to import by setting ZFS_POOL_EXCEPTIONS in /etc/default/zfs.
* Support additional configuration variable ZFS_INITRD_ADDITIONAL_DATASETS
to mount additional filesystems not located under your root dataset.
* Include /etc/modprobe.d/{zfs,spl}.conf in the initrd if it/they exist.
* Include the udev rule to use by-vdev for pool imports.
* Include the /etc/default/zfs file to the initrd.
* Only try /dev/disk/by-* in the initrd if USE_DISK_BY_ID is set.
* Use /dev/disk/by-vdev before anything.
* Add /dev as a last ditch attempt.
* Fallback to using the cache file if that exist if nothing else worked.
* Use /sbin/modprobe instead of built-in (BusyBox) modprobe.
This gets rid of the message "modprobe: can't load module zcommon".
Thanx to pcoultha for finding this.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2116Closes#2114
For kernels which do not implement a per-suberblock shrinker,
those older than Linux 3.1, the shrink_dcache_parent() function
was used to attempt to reclaim dentries. This was found not be
entirely reliable and could lead to performance issues on older
kernels running meta-data heavy workloads.
To address this issue a zfs_sb_prune_aliases() function has been
added to implement this functionality. It relies on traversing
the list of znodes for a filesystem and adding them to a private
list with a reference held. The private list can then be safely
walked outside the z_znodes_lock to prune dentires and drop the
last reference so the inode can be freed.
This provides the same synchronous behavior as the per-filesystem
shrinker and has the advantage of depending on only long standing
interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3501
Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods.
The ->read_iter() and ->write_iter() methods were designed to replace
the ->aio_read() and ->aio_write() interfaces. Both interfaces were
preserved for several kernel releases in order to migrate all existing
consumers to the new interfaces. But as of Linux 4.1 the legacy
interface has been retired and the ZFS code must be updated to use
the new interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3352
Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in
ZoL by zfs_sb_prune(). This patch calls the shrinker for each on-line
NUMA node in order that memory be freed for each one.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3495
Stock Linux 2.6.32 and earlier kernels contained a broken version of
rwsem_is_locked() which could return an incorrect value. Because of
this compatibility code was added to detect the broken implementation
and replace it with our own if needed.
The fix for this issue was merged in to the mainline Linux kernel as
of 2.6.33 and the major enterprise distributions based on 2.6.32 have
all backported the fix. Therefore there is no longer a need to carry
this code and it can be removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#454
* Based on the init scripts included with Debian GNU/Linux, then take code
from the already existing ones, trying to merge them into one set of
scripts that will work for 'everyone' for better maintainability.
* Add configurable variables to control the workings of the init scripts:
* ZFS_INITRD_PRE_MOUNTROOT_SLEEP
Set a sleep time before we load the module (used primarily by initrd
scripts to allow for slower media (such as USB devices etc) to be
availible before we load the zfs module).
* ZFS_INITRD_POST_MODPROBE_SLEEP
Set a timed sleep in the initrd to after the load of the zfs module.
* ZFS_INITRD_ADDITIONAL_DATASETS
To allow for mounting additional datasets in the initrd. Primarily used
in initrd scripts to allow for when filesystem needed to boot (such as
/usr, /opt, /var etc) isn't directly under the root dataset.
* ZFS_POOL_EXCEPTIONS
Exclude pools from being imported (in the initrd and/or init scripts).
* ZFS_DKMS_ENABLE_DEBUG, ZFS_DKMS_ENABLE_DEBUG_DMU_TX, ZFS_DKMS_DISABLE_STRIP
Set to control how dkms should build the dkms packages.
* ZPOOL_IMPORT_PATH
Set path(s) where "zpool import" should import pools from.
This was previously the job of "USE_DISK_BY_ID" (which is still used
for backwards compatibility) but was renamed to allow for better
control of import path(s).
* If old USE_DISK_BY_ID is set, but not new ZPOOL_IMPORT_PATH, then we
set ZPOOL_IMPORT_PATH to sane defaults just to be on the safe side.
* ZED_ARGS
To allow for local options to zed without having to change the init script.
* The import function, do_import(), imports pools by name instead of '-a'
for better control of pools to import and from where.
* If USE_DISK_BY_ID is set (for backwards compatibility), but isn't 'yes'
then ignore it.
* If pool(s) isn't found with a simple "zpool import" (seen it happen),
try looking for them in /dev/disk/by-id (if it exists). Any duplicates
(pools found with both commands) is filtered out.
* IF we have found extra pool(s) this way, we must force USE_DISK_BY_ID
so that the first, simple "zpool import $pool" is able to find it.
* Fallback on importing the pool using the cache file (if it exists) only
if 'simple' import (either with ZPOOL_IMPORT_PATH or the 'built in'
defaults) didn't work.
* The export function, do_export(), will export all pools imported, EXCEPT
the root pool (if there is one).
* ZED script from the Debian GNU/Linux packages added.
* Refreshed ZED init script from behlendorf@5e7a660 to be portable so it
may be used on both LSB and Redhat style systems.
* If there is no pool(s) imported and zed successfully shut down, we will
unload the zfs modules.
* The function library file for the ZoL init script is installed as
/etc/init.d/zfs-functions.
* The four init scripts, the /etc/{defaults,sysconfig,conf.d}/zfs config file
as well as the common function library is tagged as '%config(noreplace)' in
the rpm rules file to make sure they are not replaced automatically if locally
modifed.
* Pitfals and workarounds:
* If we're running from init, remove stale /etc/dfs/sharetab before importing
pools in the zfs-import init script.
* On Debian GNU/Linux, there's a 'sendsigs' script that will kill basically
everything quite early in the shutdown phase and zed is/should be stopped
much later than that. We don't want zed to be among the ones killed, so add
the zed pid to list of pids for 'sendsigs' to ignore.
* CentOS uses echo_success() and echo_failure() to print out status of
command. These in turn uses "echo -n \0xx[etc]" to move cursor and choose
colour etc. This doesn't work with the modified IFS variable we need to
use in zfs-import for some reason, so work around that when we define
zfs_log_{end,failure}_msg() for RedHat and derivative distributions.
* All scripts passes ShellCheck (with one false positive in do_mount()).
Signed-off-by: Turbo Fredriksson turbo@bayour.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Chris Dunlap <cdunlap@llnl.gov>
Closes#2974Closes#2107
Commit 60e9f69 added the --with-mounthelperdir option for Gentoo
and in the process accidentally modified the default installation
location. For security reasons mount(8) expects it to only be
installed under /sbin.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3426
Commit f4af6bb783 which added support
for REQ_FAILFAST_MASK but the new autoconf test didn't use the same
preprocessor macro name as the code did.
The effect is that FAILFAST mode has not been enabled for ZoL in any
post-2.6.35 kernel.
Retire the HAVE_BIO_RW_FAILFAST interface used in pre-2.6.28 kernels.
Raise an error condition if the FAILFAST interface can't be detected.
Signed-off-by: Tim Chase <tim@onlight.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3386
Provide a Redhat specific spl-kmod.spec file which uses the old style
kmods (not kmods2) packaging. By using the provided kmodtool script
packages can be built which support weak modules. This allows for the
kernel to be updated without having to rebuild the SPL kernel modules.
Packages for RHEL/Centos/SL/TOSS which use this spec file can by built
as follows:
$ ./configure --with-spec=redhat
$ make rpms
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Provide a Redhat specific zfs-kmod.spec file which uses the old style
kmods (not kmods2) packaging. By using the provided kmodtool script
packages can be built which support weak modules. This allows for the
kernel to be updated without having to rebuild the ZFS kernel modules.
Packages for RHEL/Centos/SL/TOSS which use this spec file can by built
as follows:
$ ./configure --with-spec=redhat
$ make rpms
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Originally it was thought that custom spec files might be required
for Fedora. Happily that has turns out not to be the case. Since
this directory just contains symlinks to the generic spec files it
can be removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Originally it was thought that custom spec files might be required
for Fedora. Happily that has turns out not to be the case. Since
this directory just contains symlinks to the generic spec files it
can be removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If kernel lock debugging is enabled, the fs_struct structure exceeds the
typical 1024 byte limit of CONFIG_FRAME_WARN and isn't enabled when it
otherwise should be.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#440
Explicitly disable the unused by variable warnings by setting
__attribute__((unused)) for bdi_setup_and_register(). This is
required because the function is defined with the __must_check
attribute.
Signed-off-by: Bill McGonigle <bill-github.com-public1@bfccomputing.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3141
The 'capabilities' argument which was passed to bdi_setup_and_register()
has been removed. File systems should no longer pass BDI_CAP_MAP_COPY.
For our purposes this means there are now three different interfaces
which must be handled. A zpl_bdi_setup_and_register() wrapper function
has been introduced to provide a single interface to the ZPL code.
* 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported.
* 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments.
* 4.0 - x.y, bdi_setup_and_register() takes 2 arguments.
I've also taken this opportunity to remove HAVE_BDI because kernels
older then 2.6.32 are no longer supported. All kernels newer than
this will have one of the above interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#3128
To minimize the size of a kmutex_t a MUTEX_OWNER check was added.
It allowed the kmutex_t wrapper to leverage the mutex owner which was
already stored in the mutex for certain kernel configurations.
The upside to this was that it reduced the size of the kmutex_t wrapper
structure by the size of a task_struct pointer (4/8 bytes). The
downside was that two mutex implementations needed to be maintained.
Depending on your exact kernel configuration the correct one would
be selected.
Over the years this solution worked but it could be fragile since it
depending heavily on assumed kernel mutex implementation details. For
example the SPL_AC_MUTEX_OWNER_TASK_STRUCT configure check needed to
be added when the kernel changed how the owner was stored. It also
made the code more complicated than it needed to be.
Therefore, in the name of simplicity and portability this optimization
is being retired. It will slightly increase the memory requirements
for a kmutex_t but only very slightly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)
Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3084
Using AC_LANG_SOURCE with some versions of autoconf is problematic if
the given source is to be written to a header file. Such versions assume
the contents are to be written to conftest.c and generate shell code to
that effect. The contents of the test program to detect support for
Linux tracepoints were consequently malformed (containing the source for
conftest.h) so the build system incorrectly disabled tracepoints
support. Fix this in ZFS_LINUX_TRY_COMPILE_HEADER by passing the header
source directly to ZFS_LINUX_COMPILE_IFELSE.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
When the SPL was originally written Linux tracepoints were still
in their infancy. Therefore, an entire debugging subsystem was
added to facilite tracing which served us well for many years.
Now that Linux tracepoints have matured they provide all the
functionality of the previous tracing subsystem. Rather than
maintain parallel functionality it makes sense to fully adopt
tracepoints. Therefore, this patch retires the legacy debugging
infrastructure.
See zfsonlinux/zfs@bc9f413 for the tracepoint changes.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#408
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This file may be added by automake and therefore should be added
to config/.gitignore. For the full list of possible auxiliary
programs see the full automake documentation.
http://www.gnu.org/software/automake/manual/automake.html#Auxiliary-Programs
Signed-off-by: Marcel Wysocki <maci.stgn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This file may be added by automake and therefore should be added
to config/.gitignore. For the full list of possible auxiliary
programs see the full automake documentation.
http://www.gnu.org/software/automake/manual/automake.html#Auxiliary-Programs
Signed-off-by: Marcel Wysocki <maci.stgn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2848
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. We could handle this by disabling systemd support on
prefix because systemd does not check these paths, but the Gentoo
Council decided that small files such as these should be installed.
That means disabling systemd support on prefix is not an acceptable
workaround. As a consequence, we need some way of control the directory
into which these files are installed.
Making this configurable increases our compliance with the
freedesktop.org specification, which allows these files to be installed
into /etc/modules-load.d:
http://www.freedesktop.org/software/systemd/man/modules-load.d.html
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. I could script a workaround inside the ebuild, but it
seemed to make more sense to make this more configurable.
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
Since we changed the default location for the kernel headers to respect
--prefix in the SPL, we must search that location to prevent user builds
from breaking.
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
The vfs_fsync() function has been available since Linux 2.6.29.
There is no longer a need to maintain this compatibility code.
However, the HAVE_2ARGS_VFS_FSYNC check was left in place
since that change occured after 2.6.32.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kern_path() function has been available since Linux 2.6.28.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kvasprintf() function has been available since Linux 2.6.22.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of Linux 2.6.32 the proc handlers where updated to expect only
five arguments. Therefore there is no longer a need to maintain
this compatibility code and this infrastructure can be simplified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The groups_search() function was never exported by a mainline kernel
therefore we drop this compatibility code and always provide our own
implementation.
Additionally, the cred_t structure has been available since 2.6.29
so there is no longer a need to maintain compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Just for consistency with the other autoconf checks a small comment
block was added before these checks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This function has never been exported by any mainline and was only
briefly available under RHEL5. Therefore this check is being removed
and the code update to always use the wrapper function.
The next step will be to eliminate all this code. If ZFS were updated
not to assume that it's pwd was / there would be no need for this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The user_path_dir() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
After the removable of get_vmalloc_info(), the unused global memory
variables, and the optional dcache/icache shrinkers there is no
longer a need for the kallsyms compatibility code. This allows
us to eliminate another brittle area of the code by removing the
kernel upcall this functionality depended on for older kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is optional functionality which may or may not be useful to
ZFS when using older kernels. It is never a hard requirement.
Therefore this functionality is being removed from the SPL and
a simpler slimmed down version will be added to ZFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Platforms such as Illumos and FreeBSD have historically provided
global variables which summerize the memory state of a system.
Linux on the otherhand doesn't expose any of this information
to kernel modules and uses entirely different mechanisms for
memory management.
In order to simplify the original ZFS port to Linux these global
variables were emulated by the SPL for the benefit of ZFS. As ZoL
has matured over the years it has moved steadily away from these
interfaces and now no longer depends on them at all.
Therefore, this patch completely removes the global variables
availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree,
and swapfs_reserve. This greatly simplifies the memory management
code and eliminates a common area of confusion.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The get_vmalloc_info() function was used to back the vmem_size()
function. This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.
However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The mutex_lock_nested() function has been available since Linux 2.6.18.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The inode structure has used i_mutex as its internal locking
primitive since 2.6.16. The compatibility code to check for
the previous semaphore primitive has been removed. However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kmalloc_node() function has been available since Linux 2.6.12.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The uaccess header has been available in the same location since
Linux 2.6.18. There is no longer a need to maintain this
compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The uintptr_t typedef has been available since Linux 2.6.24.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The atomic64_xchg() and atomic64_cmpxchg() functions have been
available since Linux 2.6.24. There is no longer a need to
maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many of the time functions had grown overly complex in order to
handle kernel compatibility issues. However, as of Linux 2.6.26
all the required functionality is available. This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64(). This allows us
to provide an optimized implementation and simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19. There is no longer any reason to maintain this
compatibility code. There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly. This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the SPL was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
The HAVE_IOCTL_* configure checks were originally added for
compatibility with an ancient version of glibc. This support
and additional complexity is no longer needed and is therefore
being removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#585
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#370
Linux kernel 3.17 removes the action function argument from
wait_on_bit(). Add autoconf test and compatibility macro to support
the new interface.
The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule(). This API change
was made to consolidate all of those redundant wait functions.
References: torvalds/linux@7431620
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#378
There are two common locations where udev and dracut components are
commonly installed. When building packages using the 'make rpm|deb'
targets check those common locations and pass them to rpmbuild. For
non-standard configurations these values can be provided by the
the following configure options:
--with-udevdir=DIR install udev helpers [default=check]
--with-udevruledir=DIR install udev rules [[UDEVDIR/rules.d]]
--with-dracutdir=DIR install dracut helpers [default=check]
When rebuilding using the source packages the per-distribution
default values specified in the spec file will be used. This is
the preferred way to build packages for a distribution but the
ability to override the defaults is provided as a convenience.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2310Closes#1680
Set LANG=C before calling 'rpmbuild' to avoid rpmbuild failing on
the translated date string in the changelog.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: zfsonlinux/spl#306
Set LANG=C before calling 'rpmbuild' to avoid rpmbuild failing on
the translated date string in the changelog.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306
This adds ability to set the location of the kernel via defines
when building from the spec files. This is useful when building
against a kernel installed in a non-standard location.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1874
From day one the various ZFS libraries should have been placed in their
own sub-packages. Primarily this allows for multiple major versions of
the libraries to be concurrently installed. It also facilitates a
smaller build environment by minimizing the required dependencies.
The specific changes required to split the libraries from the utilities
are as follows:
* libzpool2, libnvpair1, libuutil1, and libzfs2 packages were added
and contain the versioned shared libraries. The Fedora packaging
guidelines discourage providing static libraries so they are not
included in the packages.
http://fedoraproject.org/wiki/Packaging:Guidelines#Packaging_Static_Libraries
* The zfs-devel package was renamed libzfs2-devel and the new package
obsoletes the old zfs-devel package. This package includes all
the required headers for the libzpool2, libnvpair1, libuutil1, and
libzfs2 libraries and their respective unversioned shared libraries.
This package should eventually be split in to individual lib*-devel
packages but it will still take some work to cleanly separate them.
Therefore the libzfs2-devel package provides the expected lib*-devel
packages so the all proper dependencies can still be created.
http://fedoraproject.org/wiki/Packaging:Guidelines#Devel_Packages
* Moved '/sbin/ldconfig' execution from the zfs packge to each of the
new library packages as described by the packaging guidelines.
http://fedoraproject.org/wiki/Packaging:Guidelines#Shared_Libraries
* The /usr/share/doc/ files were moved in to the libzfs2-devel package.
* Updated config/deb.am to be aware of the packaging changes. This
ensures that 'deb-utils' make target converts all the resulting
packages generated by the 'rpm-utils' target.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2329Closes: #2341
Issue: #2145
When creating packages in a git repository the release number
can be automatically set by 'git describe'. This normally works
well but if your repository has newer tags which match the form
NAME-VERSION* the release may be incorrectly calculated. To
prevent this the match patten has been restricted to the contents
of the META file, NAME-VERSION.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating packages in a git repository the release number
can be automatically set by 'git describe'. This normally works
well but if your repository has newer tags which match the form
NAME-VERSION* the release may be incorrectly calculated. To
prevent this the match patten has been restricted to NAME-VERSION.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL. These include:
1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.
Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator. This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on. However, there are some things we need to be careful of:
1) The Linux slab allocator was never designed to work well with
large objects. Because the SPL slab must still handle this use
case a cut off limit was added to transition from Linux slab
backed objects to kmem or vmem backed slabs.
spl_kmem_cache_slab_limit - Objects less than or equal to this
size in bytes will be backed by the Linux slab. By default
this value is zero which disables the Linux slab functionality.
Reasonable values for this cut off limit are in the range of
4096-16386 bytes.
spl_kmem_cache_kmem_limit - Objects less than or equal to this
size in bytes will be backed by a kmem slab. Objects over this
size will be vmem backed instead. This value defaults to
1/8 a page, or 512 bytes on an x86_64 architecture.
2) Be aware that using the Linux slab may inadvertently introduce
new deadlocks. Care has been taken previously to ensure that
all allocations which occur in the write path use GFP_NOIO.
However, there may be internal allocations performed in the
Linux slab which do not honor these flags. If this is the case
a deadlock may occur.
The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us. And ideally
need to move completely away from using the SPLs slab for large
memory allocations. This patch is a first step.
NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
backed caches but it is not supported for kmem/vmem backed caches.
2) Regardless of the spl_kmem_cache_*_limit settings a cache may
be explicitly set to a given type by passed the KMC_KMEM,
KMC_VMEM, or KMC_SLAB flags during cache creation.
3) The constructors, destructors, and reclaim callbacks are all
functional and will be called regardless of the cache type.
4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
the issues involved in presenting correct object accounting.
Instead they will appear in /proc/slabinfo under the same names.
5) Several kmem SPLAT tests needed to be fixed because they relied
incorrectly on internal kmem slab accounting. With the updated
test cases all the SPLAT tests pass as expected.
6) An autoconf test was added to ensure that the __GFP_COMP flag
was correctly added to the default flags used when allocating
a slab. This is required to ensure all pages in higher order
slabs are properly refcounted, see ae16ed9.
7) When using the SLUB allocator there is no need to attempt to
set the __GFP_COMP flag. This has been the default behavior
for the SLUB since Linux 2.6.25.
8) When using the SLUB it may be desirable to set the slub_nomerge
kernel parameter to prevent caches from being merged.
Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#356
Detect the updated vfs_rename() interface and call it with an
extra flags argument.
References:
torvalds/linux@520c8b1
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
We need inode_owner_or_capable() for ZFS file attributes in addition to
xattrs, so it should go into its own file. This moves it into its own
file and changes it to be more comprehensive. It will now fail if no
known good API is detected.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1691
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
zed supports a '-M' cmdline opt to lock all pages in memory via
mlockall(). The _POSIX_MEMLOCK define is checked to determine whether
this function is supported. The current test assumes mlockall()
is supported if _POSIX_MEMLOCK is non-zero. However, this test is
insufficient according to mlock(2) and sysconf(3). If _POSIX_MEMLOCK
is -1, mlockall() is not supported; but if _POSIX_MEMLOCK is 0,
availability must be checked at runtime.
This commit adds an autoconf check for mlockall() to user.m4. The zed
code block for mlockall() is now guarded with a test for HAVE_MLOCKALL.
If defined, mlockall() will be called and its runtime availability
checked via its return value.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
Add macro definitions to AM_CPPFLAGS to propagate makefile installation
directory variables for libexecdir, runstatedir, sbindir, and
sysconfdir.
https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Installation-Directory-Variables.html
A corollary is that you should not use these variables except
in makefiles. For instance, instead of trying to evaluate
datadir in configure and hard-coding it in makefiles using e.g.,
'AC_DEFINE_UNQUOTED([DATADIR], ["$datadir"], [Data directory.])',
you should add -DDATADIR='$(datadir)' to your makefile's definition
of CPPFLAGS (AM_CPPFLAGS if you are also using Automake).
The runstatedir directory is for "installing data files which the
programs modify while they run, that pertain to one specific machine,
and which need not persist longer than the execution of the program".
https://www.gnu.org/prep/standards/html_node/Directory-Variables.html
It will be defined by autoconf 2.70 or later, and default to
"$(localstatedir)/run".
http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=commit;h=a197431414088a417b407b9b20583b2e8f7363bd
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
torvalds/linux@8077c0d983 added a
__must_check to the bdi_setup_and_register(), which caused our autotools
check to break. zfsonlinux/zfs@729210564a
was intended to correct that, but it depended on -Wno-unused-result,
which is unrecognized in older GCC versions. That commit has been
reverted in favor of a solution that does not require
-Wno-unused-result.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2102Closes#2135
Older GCC versions do not obey -Wno-unused-result. This reverts commit
729210564a in favor of a solution that
does not require -Wno-unused-result.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1906
When testing compiler flags, we only need to do compile test. Otherwise,
configure will fail with "configure: error: cannot run test program while
cross compiling" when cross compiling.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2191
This adds systemd unit files replacing the functionality offered by
the SysV init script found in etc/init.d.
It has been developed and tested on Fedora 19, Fedora 20
and openSuSE 13.1.
Four unit files and one target are offered.
zfs-import-cache.service:
Import pools from /etc/zfs/zpool.cache. This unit will wait for
udev to settle.
zfs-import-scan.service:
Import pools by scanning /dev/disk/by-id for zvols. This unit will
only run if /etc/zfs/zpool.cache is not present. This unit will wait
for udev to settle
zfs-mount.service:
Mount ZFS native filesystems. It contains a dependency to be loaded
before local-fs.target.
zfs-share.service:
Share NFS/SMB filesystems. This unit contains a dependency that
will cause it to be restarted whenever the smb or nfs-server unit
is restarted, restoring the shares added.
zfs.target:
This target pulls in the other units in order to start ZFS. It's
the only unit that can be enabled/disabled, all other services
are static and pulled in by dependencies. It will honour zfs=off
and zfs=no options on the kernel command line.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2108
GCC >+ 4.8's aggressive loop optimization breaks some of the iterators
over the dn_blkptr[] pseudo-array in dnode_phys. Since dn_blkptr[] is
defined as a single-element array, GCC believes an iterator can only
access index 0 and will unroll the loop into a single iteration.
One way to resolve the issue would be to cast the array to a pointer
and fix all the iterators that might break. The only loop where it
is known to cause a problem is this loop in dmu_objset_write_ready():
for (i = 0; i < dnp->dn_nblkptr; i++)
bp->blk_fill += dnp->dn_blkptr[i].blk_fill;
In the common case where dn_nblkptr is 3, the loop is only executed a
single time and "i" is equal to 1 following the loop.
The specific breakage caused by this problem is that the blk_fill of
root block pointers wouldn't be set properly when more than one blkptr
is in use (when no indrect blocks are needed).
The simple reproducing sequence is:
zpool create tank /tank.img
zdb -ddddd tank 0
Notice that "fill=31", however, there are two L0 indirect blocks with
"F=31" and "F=5". The fill count should be 36 rather than 31. This
problem causes an assert to be hit in a simple "zdb tank" when built
with --enable-debug.
However, this approach was not taken because we need to be absolutely
sure we catch all instances of this unwanted optimization. Therefore,
the build system has been updated to detect if GCC supports the
aggressive loop optimization. If it does the optimization will be
explicitly disabled using the -fno-aggressive-loop-optimization option.
Original-fix-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2010Closes#2051
Four new dataset properties have been added to support SELinux. They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8). When one of
these properties is set to something other than 'none'. That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.
For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:
$ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media
This will ensure the filesystem is automatically mounted with that
rootcontext. It is equivalent to manually specifying the rootcontext
with the -o option like this:
$ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media
By default all four contexts are set to 'none'. Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1504
This broke compilation against Linux 3.13 and GCC 4.7.3.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1906
This check was originally added for SLES10, a093c6a, to check for
a 'struct vfsmount *' argument which they added. However, since
SLES10 is based on a 2.6.16 kernel which is no longer supported
this functionality was dropped. The checks were refactored to
support Linux 3.13 without concern for historical versions.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#312
There's a missing semicolon and equals sign in the first hunk of this
commit in config/kernel-bdi.m4. This results in the test always
failing. The effects were noticed when rrdtool, a tool which modifies
files by mmap() and msync(), would have data never get saved to disk
in spite of the files working while the mounted filesystem remains
mounted.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1889
torvalds/linux@24f7c6 introduced a new shrinker API while
torvalds/linux@a0b021 dropped support for the old shrinker API.
This patch adds support for the new shrinker API by wrapping
the old one with the new one.
This change also reorganizes the autotools checks on the shrinker
API such that the configure script will fail early if an unknown
API is encountered in the future.
Support for the set_shrinker() API which was used by Linux 2.6.22
and older has been dropped. As a general rule compatibility is
only maintained back to Linux 2.6.26.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1732
Closes zfsonlinux/zfs#1822
Closes#293Closes#307
Needed for Illumos #3582. This interface is supposed to support
a variable-resolution timeout with nanosecond granularity. This
implementation rounds up to microsecond resolution, as nanosecond-
precision timing is rarely needed for real-world performance
tuning and may incur unnecessary busy-waiting. usleep_range() is
used if available, otherwise udelay() or msleep() are used
depending on the length of the delay interval.
Add flags from sys/callo.h as these are used to control the behavior of
cv_timedwait_hires(). Specifically,
CALLOUT_FLAG_ABSOLUTE
Normally, the expiration passed to the timeout API functions is
an expiration interval. If this flag is specified, then it is
interpreted as the expiration time itself.
CALLOUT_FLAG_ROUNDUP
Roundup the expiration time to the next resolution boundary. If this
flag is not specified, the expiration time is rounded down.
References:
https://www.illumos.org/issues/3582illumos/illumos-gate@0689f76
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#304
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems. Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.
By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property. Set the property to
'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.
This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).
http://www.tuxera.com/community/posix-test-suite/
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#170
libblkid support is dormant because the autotools check is broken and
liblkid identifies ZFS vdevs as "zfs_member", not "zfs". We fix that
with a few changes:
First, we fix the libblkid autotools check to do a few things:
1. Make a 64MB file, which is the minimum size ZFS permits.
2. Make 4 fake uberblock entries to make libblkid's check succeed.
3. Return 0 upon success to make autotools use the success case.
4. Include stdlib.h to avoid implicit declration of free().
5. Check for "zfs_member", not "zfs"
6. Make --with-blkid disable autotools check (avoids Gentoo sandbox violation)
7. Pass '-lblkid' correctly using LIBS not LDFLAGS.
Second, we change the libblkid support to scan for "zfs_member", not
"zfs".
This makes --with-blkid work on Gentoo.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
c38367c73f was meant to eliminate runtime
function pointer modifications in autotools checks because they were
prone to false negatives on kernels hardened by the PaX project.
Unfortunately, I missed the xattr_handler and super_block->s_bdi
autotools checks. Recent changes to PaX constified
xattr_handler->get/set, which lead me to discover this oversight.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1433
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.
To handle this the code was reworked to use the new ->iterate
interface. Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.
Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions. Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.
Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function. The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names. Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1653
Issue #1591
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.
We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.
Original-patch-by: DHE
Rewrite-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#260
The iterate_supers_type() function which was introduced in the
3.0 kernel was supposed to provide a safe way to call an arbitrary
function on all super blocks of a specific type. Unfortunately,
because a list_head was used a bug was introduced which made it
possible for iterate_supers_type() to get stuck spinning on a
super block which was just deactivated.
This can occur because when the list head is removed from the
fs_supers list it is reinitialized to point to itself. If the
iterate_supers_type() function happened to be processing the
removed list_head it will get stuck spinning on that list_head.
The bug was fixed in the 3.3 kernel by converting the list_head
to an hlist_node. However, to resolve the issue for existing
3.0 - 3.2 kernels we detect when a list_head is used. Then to
prevent the spinning from occurring the .next pointer is set to
the fs_supers list_head which ensures the iterate_supers_type()
function will always terminate.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1045Closes#861Closes#790
Linux kernel commit torvalds/linux@db2a144 changed the return type
of block_device_operations->release() to void. Detect the expected
prototype and defined our callout accordingly.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1494
Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to
PDE_DATA(). To handle this detect the prefered interface
and define a PDE_DATA() wrapper for consistency.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Linux kernel commmit torvalds/linux@db3808c1 moved the
vmalloc_info structure from a private to a public header.
Now that it's available for kernel modules use it.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.
Tested with xfstests 285 and 286. All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.
Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.
An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1384
The previous code was only waiting for the symver file. But the
postinst target of the DKMS script for SPL will not only create
the symvers file, but also the header spl_config.h.
If we are waiting in the configure script of ZFS for the SPL
symvers file, then we also need to wait for spl_config.h.
Otherwise the configure script will abort because the spl_config.h
is not yet available.
On top of that, the function ZFS_AC_SPL_MODULE_SYMVERS is moved
to the end of the function ZFS_AC_SPL to allow both checks share
the with-spl-timeout parameter.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1431
When the kmod packaging was introduced the ability to pass the
--enable-debug and --enable-dmu-tx options from configure all
the way through to `make rpm|deb` was accidenally lost. Update
ZFS_AC_RPM to explicitlu set RPM_DEFINE_COMMON with these
rpmbuild defines.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1402
Preserve the release field when creating Debian packages. The
--keep-version option was not used because it results in a failure
when the git '<commit>_<hash>' syntax is used for the release.
The '_' is a valid character for RPM packages but not for DEBs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #1402
Issue #928
When building a custom release in a git tree provide the ability
to prevent the release field from being overwritten by the
`git describe` output.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1402
Preserve the release field when creating Debian packages. The
--keep-version option was not used because it results in a failure
when the git '<commit>_<hash>' syntax is used for the release.
The '_' is a valid character for RPM packages but not for DEBs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue zfsonlinux/zfs#1402
Issue zfsonlinux/zfs#928
When building a custom release in a git tree provide the ability
to prevent the release field from being overwritten by the
`git describe` output.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1402
The only remaining perl dependency is part of the ZFS_AC_META macro.
By eliminating this and replacing it with awk we can avoid the need
to pull in perl to rebuild the packages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1380
The only remaining perl dependency is part of the SPL_AC_META macro.
By eliminating this and replacing it with awk we can avoid the need
to pull in perl to rebuild the packages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1380
Rationale see section 3.5 "Using `autoreconf' to Update `configure'
Scripts" of the autoconf manual.
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
-D and -I are preprocessor flags, so should preferably be in the
appropriate variable.
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
-D and -I are preprocessor flags, so should preferably be in the
appropriate variable.
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When building from an arbitrary commit in the git tree it's useful
for the resulting packages to be uniquely identifiable. Therefore,
the build system has been updated to detect if your compiling in
git tree.
If you are building in a git tree, and there are commits after the
last annotated tag. Then the <id>-<hash> component of 'git describe'
will be used to overwrite the 'Release:' field in the META file.
The only tricky part is that to ensure the 'make dist' tarball is
built using the correct release. A dist-hook was added to the top
level make file to rewrite the META file using the correct release.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#195
Issue #111
When building from an arbitrary commit in the git tree it's useful
for the resulting packages to be uniquely identifiable. Therefore,
the build system has been updated to detect if your compiling in
git tree.
If you are building in a git tree, and there are commits after the
last annotated tag. Then the <id>-<hash> component of 'git describe'
will be used to overwrite the 'Release:' field in the META file.
The only tricky part is that to ensure the 'make dist' tarball is
built using the correct release. A dist-hook was added to the top
level make file to rewrite the META file using the correct release.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Refresh the existing RPM packaging to conform to the 'Fedora
Packaging Guidelines'. This includes adopting the kmods2
packaging standard which is used fod kmods distributed by
rpmfusion for Fedora/RHEL.
http://fedoraproject.org/wiki/Packaging:Guidelineshttp://rpmfusion.org/Packaging/KernelModules/Kmods2
While the spec files have been entirely rewritten from a
user perspective the only major changes are:
* The Fedora packages now have a build dependency on the
rpmfusion repositories. The generic kmod packages also
have a new dependency on kmodtool-1.22 but it is bundled
with the source rpm so no additional packages are needed.
* The kernel binary module packages have been renamed from
zfs-modules-* to kmod-zfs-* as specificed by kmods2.
* The is now a common kmod-zfs-devel-* package in addition
to the per-kernel devel packages. The common package
contains the development headers while the per-kernel
package contains kernel specific build products.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1341
Refresh the existing RPM packaging to conform to the 'Fedora
Packaging Guidelines'. This includes adopting the kmods2
packaging standard which is used fod kmods distributed by
rpmfusion for Fedora/RHEL.
http://fedoraproject.org/wiki/Packaging:Guidelineshttp://rpmfusion.org/Packaging/KernelModules/Kmods2
While the spec files have been entirely rewritten from a
user perspective the only major changes are:
* The Fedora packages now have a build dependency on the
rpmfusion repositories. The generic kmod packages also
have a new dependency on kmodtool-1.22 but it is bundled
with the source rpm so no additional packages are needed.
* The kernel binary module packages have been renamed from
spl-modules-* to kmod-spl-* as specificed by kmods2.
* The is now a common kmod-spl-devel-* package in addition
to the per-kernel devel packages. The common package
contains the development headers while the per-kernel
package contains kernel specific build products.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#222
This was a suggestion that Brian Behlendorf made when reviewing an early
pull request for Linux 3.9 support. This commit was made intentionally
easy to revert should we ever have a reason to reintroduce support for
older kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
torvalds/linux@dcf787f391 enforces
const-correctness in passing struct path *.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function prototype of vfs_getattr previoulsy took struct vfsmount *
and struct dentry * as arguments. These would always be defined together
in a struct path *.
torvalds/linux@3dadecce20 modified
vfs_getattr to take struct path * is taken as an argument instead.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Linux 3.9 reorganized sched.h, splitting it into numerous files.
torvalds/linux@8bd75c77b7 moved MAX_PRIO
and MAX_RT_PRIO to linux/sched/rt.h.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because the install location for the spl/zfs-devel headers was
changed we need to refresh the auto-detect code. Note that
for packaging which already explicitly calls --with-spl{-obj}
nothing has changed.
The updated code is now structured like that in ZFS_AC_KERNEL
and should be cleaner and easier to maintain. In addition,
it's stricter about detecting a valid source and object
directory. It requires:
* The source directory contains the file 'spl.release'
* The object directory contains the file 'spl_config.h'
* The following paths will be checked. Notice the /var/lib/
and /usr/src paths require that the spl and zfs version be
matched. This is done to prevent accidentally mixing releases.
dnl # 1) /var/lib/dkms/spl/<version>/build
dnl # 2) /usr/src/spl-<version>/<kernel-version>
dnl # 3) /usr/src/spl-<version>
dnl # 4) ../spl
dnl # 5) /usr/src/kernels/<kernel-version>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kernel modules are now available in the Arch User Repository
(AUR) via zfs. Since their packaging is maintained and superior
to ours it is being removed from the tree.
https://wiki.archlinux.org/index.php/ZFS
Now that various distributions are picking up the packages we
should eventually be able to remove most of this infrastructure.
Packaging belongs with the distributions not upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The standard dracut directory has moved from /usr/share/dracut to
/usr/lib/dracut. To ensure the dracut modules get installed in
the correct location provide a --with-dracutdir configure option
to set the path.
The default install location has been updated to /usr/lib/dracut
which is used by more current versions of Fedora. However, this
default is overriden by the RPM packaging for consistency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kernel modules are now available in the Arch User Repository
(AUR) via zfs. Since their packaging is maintained and superior
to ours it is being removed from the tree.
https://wiki.archlinux.org/index.php/ZFS
Now that various distributions are picking up the packages we
should eventually be able to remove most of this infrastructure.
Packaging belongs with the distributions not upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
PaX/GrSecurity patched kernels implement a dialect of C that relies on a
GCC plugin for enforcement. A basic idea in this dialect is that
function pointers in structures should not change during runtime.
This causes code that modifies function pointers at runtime to fail to
compile in many instances. The autotools checks rely on whether or
not small test cases compile against a given kernel. Some
autotools checks assume some default case if other cases fail. When one
of these autotools checks tests a PaX/GrSecurity patched kernel by
modifying a function pointer at runtime, the default case will be used.
Early detection of such situations is possible by relying on compiler
warnings, which are compiler errors when --enable-debug is used.
Unfortunately, very few people build ZFS with --enable-debug. The more
common situation is that these issues manifest themselves as runtime
failures in the form of NULL pointer exceptions.
Previous patches that addressed such issues with PaX/GrSecurity
compatibility largely relied on rewriting autotools checks to avoid
runtime function pointer modification or the addition of PaX/GrSecurity
specific checks. This patch takes the previous work to its logical
conclusion by eliminating the use of runtime function pointer
modification. This permits the removal of PaX-specific autotools checks
in favor of ones that work across all supported kernels.
This should resolve issues that were reported to occur with
PaX/GrSecurity-patched Linux 3.7.5 kernels on Gentoo Linux.
https://bugs.gentoo.org/show_bug.cgi?id=457176
We should be able to prevent future regressions in PaX/GrSecurity
compatibility by ensuring that all changes to ZFSOnLinux avoid runtime
function pointer modification. At the same time, this does not solve the
issue of silent failures triggering default cases in the autotools
check, which is what permitted these regressions to become runtime
failures in the first place. This will need to be addressed in a future
patch.
Reported-by: Marcin Mirosław <bug@mejor.pl>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1300
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.
Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.
As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.
This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.
Thanks to Richard Kojedzinszky for catching this nasty bug.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1318
As a matter of fact, we're already using -Werror for most tests because
of a bug in kernel-bio-empty-barrier.m4 which sets -Werror without
reverting it afterwards. This meant that all tests which ran after this
one was using -Werror.
This patch simply makes it clear that we're using -Werror and makes
the code more readable and more predictable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1317
The HAVE_MUTEX_OWNER_TASK_STRUCT fails on PPC64 with the following
error:
error: 'current' undeclared (first use in this function)
We include linux/sched.h to ensure that current is available.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The SPL_AC_ATOMIC_SPINLOCK, SPL_AC_TYPE_ATOMIC64_CMPXCHG, and
SPL_AC_TYPE_ATOMIC64_XCHG were all directly including the
'asm/atomic.h' header. As of Linux 3.4 this header was removed
which results in a build failure.
The right thing to do is include 'linux/atomic.h' however we
can't safely do this because it doesn't exist in 2.6.26 kernels.
Therefore, we include 'linux/fs.h' which in turn includes the
correct atomic header regardless of the kernel version.
When these incorrect APIs are used in ZFS the following build
failure results.
arc.c:791:80: warning: '__ret' may be used uninitialized
in this function [-Wuninitialized]
arc.c:791:1875: error: call to '__cmpxchg_wrong_size'
declared with attribute error: Bad argument size for cmpxchg
Since this is all Linux 2.6.24 compatibility code there's
an argument to be made that it should be removed because
kernels this old are not supported. However, because we're
so close to a release I'm going to leave it in place for now.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#814Closeszfsonlinux/zfs#1254
Check at ./configure time that the kernel was built with kallsyms
support. If the kernel doesn't have CONFIG_KALLSYMS defined the
modules will still compile cleanly but will not be loadable. So
we really want to catch this early during ./configure. Note that
we do not require CONFIG_KALLSYMS_ALL but it may be safely defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6
Commit 4b2f65b253 increased the user
space stack by 4x to resolve certain stack overflows. As such it
no longer makes sense to worry about a single extra page which
might or might not be part of the process stack. There is now
ample headroom for normal usage.
By eliminating this configure check we are also resolving the
following segfault which intentionally occurs at configure time
and may be logged in dmesg.
conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe
sp 00007fbf18a4be50 error 6 in conftest[400000+1000]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
It's doubtful many people were impacted by this but commit 6c28567
accidentally broke ZFS builds for 2.6.26 and earlier kernels. This
commit depends on the lookup_bdev() function which exists in 2.6.26
but wasn't exported until 2.6.27.
The availability of the function isn't critical so a wrapper is
introduced which returns ERR_PTR(-ENOTSUP) when the function isn't
defined. This will have the effect of causing zvol_is_zvol() to
always fail for 2.6.26 kernels. This in turn means vdevs will
always get opened concurrently which is good for normal usage.
This will only become an issue if your using a zvol as a vdev in
another pool. In which case you really should be using a newer
kernel anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1205
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1223
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
Lookups in the snapshot control directory for an existing snapshot
fail with ENOENT if an earlier lookup failed before the snapshot was
created. This is because the earlier lookup causes a negative dentry
to be cached which is never invalidated.
The bug can be reproduced as follows (the second ls should succeed):
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
$ zfs snap tank@s
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
To remedy this, always invalidate cached dentries in the snapshot
control directory. Since these entries never exist on disk there is
no significant performance penalty for the extra lookups.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1192
Certain versions of gcc generate an 'unrecognized command
line option' error message when -Wunused-but-set-variable
is used unconditionally. This in turn can cause several
of the autoconf tests to misdetect an interface.
Now, the use of -Wunused-but-set-variable in the autoconf
tests was introduced by commit b9c59ec8 to address a gcc
4.6 compatibility problem. So we really only need to pass
this option for version of gcc which are known to support it.
Therefore, the tests have been updated to use the result of
the existing ZFS_AC_CONFIG_ALWAYS_NO_UNUSED_BUT_SET_VARIABLE
which determines if gcc supports this option.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1004
Check at ./configure time that the kernel was built with zlib
support enabled. This support may either be configured as a
module or builtin to the kernel. But if it's missing the build
will fail so it's best to catch this early.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#582
Some kernels require that we include the 'linux/irqflags.h'
header for the SPL_AC_3ARGS_ON_EACH_CPU check. Otherwise,
the functions local_irq_enable()/local_irq_disable() will not
be defined and the prototype will be misdetected as the four
argument version.
This change actually include 'linux/interrupt.h' which in turn
includes 'linux/irqflags.h' to be as generic as possible.
Additionally, passing NULL as the function can result in a
gcc error because the on_each_cpu() macro executes it
unconditionally. To make the test more robust we pass the
dummy function on_each_cpu_func().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#204
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure. Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().
Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.
To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed. That change
accidentally resulted in unexpected ctime updates which upset tools
like git. That was always a horrible hack and I'm happy it will
never make it in to a tagged release.
The right fix is something I initially resisted doing because I
was worried about the additional overhead. However, in hindsight
the overhead isn't as bad as I feared.
This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied. We leverage this
callback to keep the znode SAs strictly in sync with the inode.
However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime. This will cover the callpath of most concern to us.
->filemap_page_mkwrite->file_update_time->update_time->
mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764Closes#1140
All consumers of the kernel delayed work queues have been shifted
over to rely on the taskq implementation. This compatibility code
can now be removed. Any new callers which need this functionality
should use the taskq interfaces for delayed work items.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Previously this check was only performed when ./configure was
attempting to autodetect your kernel source directory. But we
should also handle the case where --with-linux was provided
and is obviously wrong. This way we catch the error before
invoking make and compiling the source with an incorrect
autoconf results.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#162
Previously this check was only performed when ./configure was
attempting to autodetect your kernel source directory. But we
should also handle the case where --with-linux was provided
and is obviously wrong. This way we catch the error before
invoking make and compiling the source with an incorrect
autoconf results.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#162
Use the bdev_physical_block_size() interface to determine the
minimize write size which can be issued without incurring a
read-modify-write operation. This is used to set the ashift
correctly to prevent a performance penalty when using AF hard
disks.
Unfortunately, this interface isn't entirely reliable because
it's not uncommon for disks to misreport this value. For this
reason you may still need to manually set your ashift with:
zpool create -o ashift=12 ...
The solution to this in the upstream Illumos source was to add
a white list of known offending drives. Maintaining such a list
will be a burden, but it still may be worth doing if we can
detect a large number of these drives. This should be considered
as future work.
Reported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#916
Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec()
function out of include/linux/fdtable.h and in to fs/file.c
making it unavailable to the SPL.
Now as it turns out we only used this function to tear down
some test infrastructure for the vn_getf()/vn_releasef() SPLAT
regression tests. Rather than implement even more autoconf
compatibilty code to handle this we just remove the test case.
This also allows us to drop three existing autoconf tests.
This does mean the SPLAT tests will no longer verify these
functions but historically they have never been a problem.
And if we feel we absolutely need this test coverage I'm
sure a more portable version of the test case could be added.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#183
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.
This is good news for the vn implementation because it removes the
need for us to handle the locking. However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.
Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels. This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.
Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#154
Use .mkdir instead of .create in 3.3 compatibility check. Linux 3.6
modifies inode_operations->create's function prototype. This causes
an autotools Linux 3.3. compatibility check for a function prototype
change in create, mkdir and mknode to fail. Since mkdir and mknode
are unchanged, we modify the check to examine it instead.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create. Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup. Instead
only the inamedata->flags are passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.
ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.
This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.
This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.
On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().
This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#168
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.
Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#906
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook. All it takes there
is to make sure changes are committed to ZIL. Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#969
This reverts commit 395350c85d which
accidentally introduced issue #955.
Pools using AF drives which were originally created with a sector
size of 512 bytes will now be correctly detected to have physical
sector size of 4096. This is desirable for a new pool, however for
an existing pool abruptly changing the sector size causes problems.
For this reason, this change is being reverted until the additional
logic can be added to detect the existing pool case. Existing
pools must use the ashift size stored in the label regardless of
what the disk reports. This is critical for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #955
Use the bdev_physical_block_size() interface to determine the
minimize write size which can be issued without incurring a
read-modify-write operation. This is used to set the ashift
correctly to prevent a performance penalty when using AF hard
disks.
Unfortunately, this interface isn't entirely reliable because
it's not uncommon for disks to misreport this value. For this
reason you may still need to manually set your ashift with:
zpool create -o ashift=12 ...
The solution to this in the upstream Illumos source was to add
a while list of known offending drives. Maintaining such a list
will be a burden, but it still may be worth doing if we can
detect a large number of these drives. This should be considered
as future work.
Reported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#916
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address. Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since removing the check for CONFIG_PREEMPT, there are no consumers of
the SPL_LINUX_CONFIG macro. As such, there is no reason to keep it
around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#164
The autoconf macro which failed if CONFIG_PREEMPT was set in the kernel
config was removed. With the inclusion of a few previous patches
targeting support for preempt enabled kernels, it is now safe to run
with this kernel config option enabled.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#83
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#718
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#718
ZFS fails to build when SPL is built into the kernel on unless
--with-spl=/path/to/kernel/sources is specified. We fallback to the
kernel sources directory when SPL is not found elsewhere to resolve
that.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closed#896
./configure erroneously detects absence of dops->d_automount
when built against a GrSecurity patched kernel.
Summerized error message found in config.log:
checking whether dops->d_automount() exists
...
In function 'main': ... error: constified variable 'dops'
cannot be local
The "dops" variable cannot be a local variable, so it's
moved to the global scope.
This test also fails if the prototype of the dops->d_automount
function pointer is changed.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes#884
This commit adds support for building a zfs-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.
By default zfs-modules-dkms-* sub package will be built as part of
the 'make rpm' target. Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.
Examples:
# To build packaged binaries as well as a dkms packages
$ ./configure && make rpm
# To build only the packaged binary utilities and dkms packages
$ ./configure && make rpm-utils rpm-dkms
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
supported for building the dkms sub package.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #535
When checking for the SPL Module.symvers file, a timeout can now be
passed in which will pause the configure step while it waits for this
file to be generated. By default, the configure behavior is unchanged as
a timeout of 0 is used. If a positive number of seconds is passed,
configure will wait that number of seconds for the Module.symvers file
before moving on.
The main motivation for this change was to support parallel execution of
'./configure && make' for the SPL and ZFS packages in preparation of
supporting DKMS based packages.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit adds support for building a spl-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.
By default spl-modules-dkms-* sub package will be built as part of
the 'make rpm' target. Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.
Examples:
# To build packaged binaries as well as a dkms packages
$ ./configure && make rpm
# To build only the packaged binary utilities and dkms packages
$ ./configure && make rpm-utils rpm-dkms
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
supported for building the dkms sub package.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#535
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.
In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).
With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#862
In the comments of commit 723aa3b0c2,
mmatuska reported that the test for invalidate_inodes_check() is broken
if invalidate_inodes() takes two arguments.
This patch fixes the issue by resorting to another approach for
detecting invalidate_inodes_check(): is simply checks if
invalidate_inodes is defined as a macro. If it is, then it concludes
that invalidate_inodes_check() is available. This will continue to work
even if the prototype of invalidate_inodes_check() changes over time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#148
This patch adds a new autoconf function: SPL_LINUX_TRY_COMPILE_SYMBOL.
This new function does the following:
- Call LINUX_TRY_COMPILE with the specified parameters.
- If unsuccessful, return false.
- If successful and we're configuring with --enable-linux-builtin,
return true.
- Else, call CHECK_SYMBOL_EXPORT with the specified parameters and
return the result.
All calls to CHECK_SYMBOL_EXPORT are converted to
LINUX_TRY_COMPILE_SYMBOL so that the tests work even when configuring
for builtin on a kernel which doesn't have loadable module support, or
hasn't been built yet.
The only exception are:
- AC_GET_VMALLOC_INFO, because we don't even have a public header to
include in the test case, but that's okay considering this symbol can
be ignored just fine.
- SPL_AC_DEVICE_CREATE, which is legacy API for 2.6.18 kernels. Since
kernels this old are no longer supported it should arguably just be
removed entirely from the build system.
Note that we're also checking for the correct prototype with an actual
call, which was not the case with CHECK_SYMBOL_EXPORT. However, for
"complicated" test cases like with multiple symbol versions (e.g.
vfs_fsync), we stick with the original behavior and only check for the
function's existence.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
Currently, when building a test case, we're compiling an entire Linux
module from beginning to end. This includes the MODPOST stage, which
generates a "conftest.mod.c" file with some boilerplate module
declaration code.
This poses a problem when configuring for built-in on kernels which have
loadable module support disabled. In this case conftest.mod.c is
referencing disabled code, resulting in a compilation failure, thus
breaking the tests.
This patch fixes the issue by faking the modpost stage when the
--enable-linux-builtin option is provided. It does so by forcing the
modpost command to be /bin/true, and using an empty conftest.mod.c file.
The test module still compiles fine, although the result isn't loadable,
but we don't really care at this point.
Note it is important to preserve the modpost stage when building out of
tree. This allows for the posibility of configure checks to leverage
this phase to identify GPL-only symbols.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
This patch adds a new option to configure: --enable-linux-builtin. When
this option is used, the following happens:
- Compilation of kernel modules is disabled.
- A failure to find UTS_RELEASE is followed by a suggestion to run
"make prepare" on the kernel source tree.
This patch also adds a new test which tries to compile an empty module
as a basic toolchain sanity test. If it fails and the option was
specified, the error is followed by a suggestion to run "make scripts"
on the kernel source tree.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
Currently, when configure --with-config is used, selective compilation
is only effective for the simple "make" case. Package builders (e.g.
make rpm) still build everything (utils and modules). This patch fixes
that.
This patch also drops the duplicate rpm-modules build target.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue zfsonlinux/zfs#851
This patch adds a new autoconf function: ZFS_LINUX_TRY_COMPILE_SYMBOL.
This new function does the following:
- Call LINUX_TRY_COMPILE with the specified parameters.
- If unsuccessful, return false.
- If successful and we're configuring with --enable-linux-builtin,
return true.
- Else, call CHECK_SYMBOL_EXPORT with the specified parameters and
return the result.
All calls to CHECK_SYMBOL_EXPORT are converted to
LINUX_TRY_COMPILE_SYMBOL so that the tests work even when configuring
for builtin on a kernel which doesn't have loadable module support, or
hasn't been built yet.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
Currently, when building a test case, we're compiling an entire Linux
module from beginning to end. This includes the MODPOST stage, which
generates a "conftest.mod.c" file with some boilerplate module
declaration code.
This poses a problem when configuring for built-in on kernels which have
loadable module support disabled. In this case conftest.mod.c is
referencing disabled code, resulting in a compilation failure, thus
breaking the tests.
This patch fixes the issue by faking the modpost stage when the
--enable-linux-builtin option is provided. It does so by forcing the
modpost command to be /bin/true, and using an empty conftest.mod.c file.
The test module still compiles fine, although the result isn't loadable,
but we don't really care at this point.
Note it is important to preserve the modpost stage when building out of
tree. The ZFS_AC_KERNEL_BLK_END_REQUEST, ZFS_AC_KERNEL_BLK_QUEUE_FLUSH,
and ZFS_AC_KERNEL_BLK_RQ_BYTES configure checks all depend on it to
identify GPL-only symbols.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
This patch adds a new option to configure: --enable-linux-builtin. When
this option is used, the following happens:
- Compilation of kernel modules is disabled.
- A failure to find UTS_RELEASE is followed by a suggestion to run
"make prepare" on the kernel source tree.
This patch also adds a new test which tries to compile an empty module
as a basic toolchain sanity test. If it fails and the option was
specified, the error is followed by a suggestion to run "make scripts"
on the kernel source tree.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
Currently, when configure --with-config is used, selective compilation
is only effective for the simple "make" case. Package builders (e.g.
make rpm) still build everything (utils and modules). This patch fixes
that.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict(). This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.
However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics. This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#784
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes. This interface used to
take the child dentry and a bool describing if the parent is needed.
NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
Support for PaX/GRSecurity patched kernels was developed against Linux
3.2. Unfortunately, an autotools check introduced for a Linux 3.3 API
fails on PaX/GRSecurity patched kernels. This causes the module to be
built against the Linux 3.2 ABI, which results in a NULL pointer
dereference at runtime.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes#794Closes#809
Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.
To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link(). This avoids compiler errors
in the PaX/GRSecurity dialect.
Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#484
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.
This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.
Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.
This change also depends on API changes which are available in 2.6.37
and latter kernels. The build system has been updated to detect this,
but there is no compatibility mode for older kernels. This means that
online expansion will NOT be available in older kernels. However, it
will still be possible to expand the vdev offline.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#808
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.
A kernel patch is available in the Gentoo bug tracker for
this issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#126
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree. This
causes the following failure:
error: implicit declaration of function '__FD_CLR'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current
__clear_close_on_exec() interface for readability. Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#124
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:
error: implicit declaration of function 'd_alloc_root'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current d_make_root()
interface for readability. Then we introduce an autotools check
to determine if d_make_root() is available. If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#776
The configure script error message for kernels built with
CONFIG_DEBUG_LOCK_ALLOC may give the impression that the issue is
strictly with license compliance. To avoid confusion add some words
indicating that the linking stage will fail if the build continues.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#773
The CONFIG_DEBUG_LOCK_ALLOC check at configure time was added to
detect when mutex_lock() is defined as a GPL-only symbol. However,
the check as written only inferred this from this configuration
setting, it never actually checked. This change introduces that
missing check to prevent false positives.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#701
Caught by lint, this permission change was accidentally introduced
by commit 42cb3819f1. Restore the
correct permissions and while I'm at it add a missing whack-bang
to config/ltmain.sh.
lint: executable-not-elf-or-script: zpool_main.c zfs_main.c
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#620
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging. When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.
This checking is particularly helpful when adding new dmu consumers
like Lustre. However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.
--enable-debug-dmu-tx - Enable/disable validation of each tx as
--disable-debug-dmu-tx it is constructed. By default validation
is disabled due to performance concerns.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
Improve the distribution detection by moving the tests for
distribution specific files first. The Ubuntu and Debian
checks are left for last because they are the least likely
to be unique. This is particularly true in the case of Debian
since so many distributions are based on Debian.
Since this is currently only used to identify the correct
packaging method for this system the result in many instances
is simply cosmetic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Improve the distribution detection by moving the tests for
distribution specific files first. The Ubuntu and Debian
checks are left for last because they are the least likely
to be unique. This is particularly true in the case of Debian
since so many distributions are based on Debian.
Since this is currently only used to identify the correct
packaging method for this system the result in many instances
is simply cosmetic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Allow a source rpm to be rebuilt with debugging enabled. This
avoids the need to have to manually modify the spec file. By
default debugging is still largely disabled. To enable specific
debugging features use the following options with rpmbuild.
'--with debug' - Enables ASSERTs
'--with debug-log' - Enables the internal debug log
'--with debug-kmem' - Enables basic memory accounting
'--with debug-kmem-tracking' - Enables detailed memory tracking
# For example:
$ rpmbuild --rebuild --with debug spl-modules-0.6.0-rc6.src.rpm
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Allow a source rpm to be rebuilt with debugging enabled. This
avoids the need to have to manually modify the spec file. By
default debugging is still largely disabled. To enable specific
debugging features use the following options with rpmbuild.
'--with debug' - Enables ASSERTs
# For example:
$ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm
Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers. This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.
We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.
Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.
To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.
Detailed rationale:
- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
handle writes of any size, so there's no reason to impose a limit. Let the
upper layer decide.
- max_segments and max_segment_size are set to unlimited. zvol_write() will
copy the requests' contents into a dbuf anyway, so the number and size of
the segments are irrelevant. Let the upper layer decide.
- physical_block_size and io_opt are set to the ZVOL's block size. This
has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
the upper layers that writes smaller than the volume's block size will be
slow.
- The NONROT flag is set to indicate this isn't a rotational device.
Although the backing zpool might be composed of rotational devices, the
resulting ZVOL often doesn't exhibit the same behavior due to the COW
mechanisms used by ZFS. Setting this flag will prevent upper layers from
making useless decisions (such as reordering writes) based on incorrect
assumptions about the behavior of the ZVOL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:
WRITE: A normal async write. Device will be plugged.
WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down
the hint that someone will be waiting on this IO
shortly.
WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on
non-volatile media on completion.
In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.
The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.
Also, we must inform the block layer that we actually support FLUSH and FUA.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#549
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s. This patch clarifies things
by cleanly breaking the two subsystem apart. The result of this
is the following behavior.
--enable-debug - Enable/disable code wrapped in ASSERT()s.
--disable-debug ASSERT()s are used to check invariants and
are never required for correct operation.
They are disabled by default because they
may impact performance.
--enable-debug-log - Enable/disable the debug log infrastructure.
--disable-debug-log This infrastructure allows the spl code and
its consumer to log messages to an in-kernel
log. The granularity of the logging can be
controlled by a debug mask. By default the
mask disables most debug messages resulting
in a negligible performance impact. Because
of this the debug log is enabled by default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the original build system code was added the release
component was accidentally omited from the development header
install path. This patch adds the missing path component so
it's always clear exactly what release your compiling against.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, SPL_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#517
Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, ZFS_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#517
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.
This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.
In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types. This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
The wait_lock member of the rw_semaphore struct became a raw_spinlock_t
in Linux 3.2 at torvalds/linux@ddb6c9b58a.
Wrap spin_lock_* function calls in a new spl_rwsem_* interface to
ensure type safety if raw_spinlock_t becomes architecture specific,
and to satisfy these compiler warnings:
warning: passing argument 1 of ‘spinlock_check’
from incompatible pointer type [enabled by default]
note: expected ‘struct spinlock_t *’
but argument is of type ‘struct raw_spinlock_t *’
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #76Closes: zfsonlinux/zfs#463
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check(). In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers. Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#58
If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.
Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.
Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#72
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on an Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the spl userland utilties and
another for the spl kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be built using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #68
As of GCC 4.6, specific kernel 2.6.32 header files do not compile
cleanly without warnings. One specific example of this is the
arch/x86/include/asm/percpu.h file. Thus, a few of the configure tests
were getting hung up on this and the '-Wno-unsued-but-set-variables'
compile option had to be introduced.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#459
This change updates the AC_LANG_PROGRAM autoconf macro invocations to be
wrapped in quotes. As of autoconf version 2.68, the quotes are necessary
to prevent warnings from appearing. Specifically, the autoconf v2.68
Forward Porting Notes specifies:
It is important to note that you need to ensure that the call to
AC_LANG_SOURCE is quoted and not expanded, otherwise that will
cause the warning to appear nonetheless.
Finally, because of the additional quoting we can drop the extra
quotas used by the ZFS_AC_CONFIG_USER_STACK_GUARD autoconf check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#464
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback. In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.
This commit updates the code to provide a zpl_fsync() function
for the updated API. Implementations for the previous two APIs
are also maintained for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#445
Preferentially use the vfs_fsync() function. This function was
initially introduced in 2.6.29 and took three arguments. As
of 2.6.35 the dentry argument was dropped from the function.
For older kernels fall back to using file_fsync() which also
took three arguments including the dentry.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules. As of Linux 3.1 it is now longer easily
available. To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Only under Ubuntu Lucid the rpm packaging step mistakenly adds
the following files twice to the package because of the /lib
naming convention. This is harmless but results in a warning
which the buildot flags as a failure. Suppress this warning.
warning: File listed twice: /lib/udev/rules.d
warning: File listed twice: /lib/udev/rules.d/60-zpool.rules
warning: File listed twice: /lib/udev/rules.d/60-zvol.rules
warning: File listed twice: /lib/udev/rules.d/90-zfs.rules
warning: File listed twice: /lib/udev/sas_switch_id
warning: File listed twice: /lib/udev/zpool_id
warning: File listed twice: /lib/udev/zvol_id
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Older versions of gcc (gcc-4.1.2) will treat an 'incompatible
pointer type' as a warning instead of an error. This results
in HAVE_FS_STRUCT_SPINLOCK being defined incorrectly. This
failure mode was observed when using a RHEL6 2.6.32 based kernel
under RHEL5.5 which contains the old version of gcc. To resolve
the issue the warning is explicitly promoted to an error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The 'if' statements found in kernel.m4 were converted to use the
portable alternative provided by autoconf, the AS_IF macro.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
A few of the autoconf error messages were inconsistent with the rest of
the build system. To be specific, the inconsistencies addressed by this
commit are the following:
* The second line of the error message for the CONFIG_PREEMPT check
was missing it's third asterisk.
* A few of the error messages were prefixed by two tabs, whereas the
majority of error messages are only prefixed by a single tab.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The warnings listed in the suppression file will be suppressed
and not flagged during regular buildbot builds. These warnings
are expected, harmless, and can obscure real issues unless they
are suppressed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The hardened gentoo kernel defines all of the super block
operation callbacks as const. This prevents the autoconf test
from assigning the callback and results in a false negative.
By moving the assignment in to the declaration we can avoid
this issue and get a correct result for this patched kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#296
This change moves the default install location for the zfs udev
rules from /etc/udev/ to /lib/udev/. The correct convention is
for rules provided by a package to be installed in /lib/udev/.
The /etc/udev/ directory is reserved for custom rules or local
overrides.
Additionally, this patch cleans up some abuse of the bindir install
location by adding a udevdir and udevruledir install directories.
This allows us to revert to the default bin install location. The
udev install directories can be set with the following new options.
--with-udevdir=DIR install udev helpers [EPREFIX/lib/udev]
--with-udevruledir=DIR install udev rules [UDEVDIR/rules.d]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#356
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
The latest kernels no longer define AUTOCONF_INCLUDED which was
being used to detect the new style autoconf.h kernel configure
options. This results in the CONFIG_* checks always failing
incorrectly for newer kernels.
The fix for this is a simplification of the testing method.
Rather than attempting to explicitly include to renamed config
header. It is simpler to unconditionally include <linux/module.h>
which must pick up the correctly named header.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#320
The latest kernels no longer define AUTOCONF_INCLUDED which was
being used to detect the new style autoconf.h kernel configure
options. This results in the CONFIG_* checks always failing
incorrectly for newer kernels.
The fix for this is a simplification of the testing method.
Rather than attempting to explicitly include to renamed config
header. It is simpler to unconditionally include <linux/module.h>
which must pick up the correctly named header.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#320
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
RPM version 4.9.0 has been observed to generate extra debug
messages in certain cases. These debug messages prevent us
from cleanly acquiring the architecture. This is clearly
an upstream RPM bug which will get fixed. But until then
a safe solution is to pipe the result through 'tail -1'
to just grab the architecture bit we care about.
Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output:
Freeing read locks for locker 0x166: 28031/47480843735008
Freeing read locks for locker 0x168: 28031/47480843735008
x86_64
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().
As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct. This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
Update the the wrapper macros for the memory shrinker to handle
this 4th API change. The callback function now takes a
shrink_control structure. This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
RPM version 4.9.0 has been observed to generate extra debug
messages in certain cases. These debug messages prevent us
from cleanly acquiring the architecture. This is clearly
an upstream RPM bug which will get fixed. But until then
a safe solution is to pipe the result through 'tail -1'
to just grab the architecture bit we care about.
Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output:
Freeing read locks for locker 0x166: 28031/47480843735008
Freeing read locks for locker 0x168: 28031/47480843735008
x86_64
The previous commit 8a7e1ceefa wasn't
quite right. This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.
For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable. This can lead to build failures
on older Linux platforms such as Debian Lenny. Since this is
an optional build argument this changes add a new autoconf check
for the option. If it is supported by the installed version of
gcc then it is used otherwise it is omited.
See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
Every distribution has slightly different requirements for their
init scripts. Because of this the zfs package contains several
init scripts for various distributions. These scripts have been
contributed by, and are supported by, the larger zfs community.
Init scripts for Gentoo/Lunar/Redhat have been contributed by:
Gentoo - devsk <devsku@gmail.com>
Lunar - Jean-Michel Bruenn <jean.bruenn@ip-minds.de>
Redhat - Fajar A. Nugraha <list@fajar.net>
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
This reverts commit 1814251453.
Demote the gawk call back to awk and ensure that stderr is attached. GNU gawk
tolerates a missing stderr handle, but many utilities do not, which could be
why a regular awk call was unexplainably failing on some systems.
Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K). The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.
For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments. When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through. When it's not we assume the worst case and
a maximally sized workspace is used.
The path_lookup() function has been renamed to kern_path_parent()
and the flags argument has been removed. The only behavior now
offered is that of LOOKUP_PARENT. The spl already always passed
this flag so dropping the flag does not impact us.
As of gcc-4.6 the option -Wunused-but-set-variable is enabled by
default. While this is a useful warning there are numerous places
in the ZFS code when a variable is set and then only checked in an
ASSERT(). To avoid having to update every instance of this in the
code we now set -Wno-unused-but-set-variable to suppress the warning.
Additionally, when building with --enable-debug and -Werror set these
warning also become fatal. We can reevaluate the suppression of these
error at a later time if it becomes an issue. For now we are basically
just reverting to the previous gcc behavior.
Newer versions of gcc are getting smart enough to detect the sloppy
syntax used for the autoconf tests. It is now generating warnings
for unused/undeclared variables. Newer version of gcc even have
the -Wunused-but-set-variable option set by default. This isn't a
problem except when -Werror is set and they get promoted to an error.
In this case the autoconf test will return an incorrect result which
will result in a build failure latter on.
To handle this I'm tightening up many of the autoconf tests to
explicitly mark variables as unused to suppress the gcc warning.
Remember, all of the autoconf code can never actually be run we
just want to get a clean build error to detect which APIs are
available. Never using a variable is absolutely fine for this.
Closes#176
Newer versions of gcc are getting smart enough to detect the sloppy
syntax used for the autoconf tests. It is now generating warnings
for unused/undeclared variables. Newer version of gcc even have
the -Wunused-but-set-variable option set by default. This isn't a
problem except when -Werror is set and they get promoted to an error.
In this case the autoconf test will return an incorrect result which
will result in a build failure latter on.
To handle this I'm tightening up many of the autoconf tests to
explicitly mark variables as unused to suppress the gcc warning.
Remember, all of the autoconf code can never actually be run we
just want to get a clean build error to detect which APIs are
available. Never using a variable is absolutely fine for this.
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes(). This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block. When only the legacy API is available the second
argument will be dropped for compatibility.
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache. After the entries are
pruned any slabs which they may have been using are reaped.
Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count. However, in practice this doesn't need to be
exactly correct. We simply need to reclaim some useful fraction
(but not all) of the cache. The caller can determine if more
needs to be done.
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash(). The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use. Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
To simplify the process of using zfs as your root filesystem a
zfs-drucat sub-package has been added. This sub-package adds a zfs
dracut module which allows your initramfs to be rebuilt with zfs
support. The process for doing this is still complicated but there
is clearly interest from the community about getting this working
well and documented. This should help lay some of the groundwork.
Longer term these changes should be pushed in the upstream dracut
package. Once that occurs this subpackage will no longer be
required for new systems, however we may want to conditionally
build this package in the future for systems running older
dracut versions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed. Unfortunately, every distribution
has their own idea of the _right_ way to do things. Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway. I have
instead added support to provide multiple distribution specific
init scripts.
The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.
Currently, there is zfs.fedora and a more generic zfs.lsb init
script. Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.
This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
Detect early on in configure if the Modules.symvers file is missing.
Without this file there will be build failures later and it's best
to catch this early and provide a useful error. In this case the
most likely problem is the kernel-devel packages are not installed.
It may also be possible that they are using an unbuilt custom kernel
in which case they must build the kernel first.
Closes#127
Detect early on in configure if the Modules.symvers file is missing.
Without this file there will be build failures later and it's best
to catch this early and provide a useful error. In this case the
most likely problem is the kernel-devel packages are not installed.
It may also be possible that they are using an unbuilt custom kernel
in which case they must build the kernel first.
Until support is added for preemptible kernels detect this at
configure time and make it fatal. Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.
Until support is added for preemptible kernels detect this at
configure time and make it fatal. Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules. This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.
Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address. The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn. Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address. All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.
Long term we should try to get this, or another, interface made
available to modules again.
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function. Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put(). Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.
Closes#114
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods. The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added. The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
The fsync() callback in the file_operations structure used to take
3 arguments. The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers. To
handle this a compatibility prototype was added to ensure the right
prototype is used. Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links. Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with. This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links. Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with. This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
ZFS even under Solaris does not strictly require libshare to be
available. The current implementation attempts to dlopen() the
library to access the needed symbols. If this fails libshare
support is simply disabled.
This means that on Linux we only need the most minimal libshare
implementation. In fact just enough to prevent the build from
failing. Longer term we can decide if we want to implement a
libshare library like Solaris. At best this would be an abstraction
layer between ZFS and NFS/SMB. Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.
Finally the bare bones user-libshare.m4 test was dropped. If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.
If libselinux is detected on your system at configure time link
against it. This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:
"context=\"system_u:object_r:file_t:s0"
For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type. Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.
Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling. Alternately, end users could modify
their existing selinux policy with a little guidance.
Refresh the autogen.sh products based on the versions which are
installed by default in the GA RHEL6.0 release.
autoconf (GNU Autoconf) 2.63
automake (GNU automake) 1.11.1
ltmain.sh (GNU libtool) 2.2.6b
Refresh the autogen.sh products based on the versions which are
installed by default in the GA RHEL6.0 release.
autoconf (GNU Autoconf) 2.63
automake (GNU automake) 1.11.1
ltmain.sh (GNU libtool) 2.2.6b
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags. The new flag is called REQ_SYNC. To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function. Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.
Preferred interface for flagging a synchronous bio:
2.6.12-2.6.29: BIO_RW_SYNC
2.6.30-2.6.35: BIO_RW_SYNCIO
2.6.36-2.6.xx: REQ_SYNC
As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags
have been unified under the REQ_* names. These flags always had
to be kept in-sync so this is a nice step forward, unfortunately
it means we need to be careful to only use the new unified flags
when the BIO_RW_* flags are not defined. Additional autoconf
checks were added for this and if it is ever unclear which method
to use no flags are set. This is safe but may result in longer
delays before a disk is failed.
Perferred interface for setting FAILFAST on a bio:
2.6.12-2.6.27: BIO_RW_FAILFAST
2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV|TRANSPORT|DRIVER}
2.6.36-2.6.xx: REQ_FAILFAST_{DEV|TRANSPORT|DRIVER}
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
The spl_config.h file is checked to determine the spl version.
However, the zfs code was looking for it in the source directory
and not the build directory.
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea. This commit also updates all of the
autoconf style support scripts in config.
Add the initial products from autogen.sh. These products will
be updated incrementally after this point as development occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add autoconf style build infrastructure to the ZFS tree. This
includes autogen.sh, configure.ac, m4 macros, some scripts/*,
and makefiles for all the core ZFS components.
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5. Details can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=526092
The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked(). The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.
This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present. The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers. Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.
Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros. We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.
Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner(). This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.
The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit. However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.
The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization. Since it is no longer used, I removed it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The RHEL5 2.6.18-194.7.1.el5 kernel added atomic64_cmpxchg to
asm-x86_64/atomic.h. That macro is defined in terms of cmpxchg which
is provided by asm/system.h. However, asm/system.h is not #included by
atomic.h in this kernel nor by the autoconf test for atomic64_cmpxchg, so
the test failed with "implicit declaration of function 'cmpxchg'". This
leads the build system to erroneously conclude that the kernel does not
define atomic64_cmpxchg and enable the built-in definition. This in
turn produces a '"atomic64_cmpxchg" redefined' build warning which is fatal
when building with --enable-debug. This commit fixes this by including
asm/system.h in the autoconf test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages. Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.
As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages. The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:
make rpm: Create .rpm packages
make deb: Create .deb packages
make tgz: Create .tgz packages
make pkg: Create the right package type for your distribution
The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:
1) Will not have the correct dependency information.
2) Will not not include the kernel version in the release.
3) Will not handle all differences between distributions.
But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this. That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.
Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests. Much to my surprise the unsigned
64-bit division regression tests failed.
This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel. This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed. After some investigation this turned out to be exactly
the case.
Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl. There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.
The update implementation now passed all the unsigned and signed
regression tests. This should be functional, but not fast, which is
good enough for out purposes. If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform. I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
As of autoconf-2.65 the AC_LANG_SOURCE source macro no longer
includes the confdef.h results when expanded. To handle this
simply explicitly include confdef.h in conftest.c. This will
cause two copies to of confdef.h to be added to the test for
earlier autoconf versions but this is not harmful.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For some reason when awk invoked by the usermode helper the command
always fails. Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk. Anyway, the simplest
thing to do here is to just make gawk mandatory. I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change. I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25. This means
builds against 2.6.25-2.6.26 kernels were broken.
To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
Whoops, I momentarilly forgot I had explicitly set these as CC
options so dependent packages which need to include spl_config.h
would not end up having these defined which can result in
accidentally hanging debug enabled at best, or a build failure
at worst.
While in theory I like the idea of compiler warnings always being
fatal. In practice this causes problems when small harmless errors
cause build failures for end users. To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure. This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed. The upstream code has been updated
to depend entirely on the the procname member. To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels. Older kernels are
supported by having it expand to .ctl_name = X just as before.
It seems the upstream community moved the definition of UTS_RELEASE
yet again as of linux-2.6.33. Update the build system to check in
all three possible locations where your kernel version may be defined.
$kernelbuild/include/linux/version.h
$kernelbuild/include/linux/utsrelease.h
$kernelbuild/include/generated/utsrelease.h
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
The cleanest way to do this is to set AM_LIBTOOLFLAGS = --silent. However,
AM_LIBTOOLFLAGS is not honored by automake-1.9.6-2.1 which is what I have
been using. To cleanly handle this I am updating to automake-1.11-3 which
is why it looks like there is a lot of churn in the Makefiles.
We need dependent packages to be able to include spl_config.h to
build properly. This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed. This solution
works as long as the spl_config.h is included before your projects
config.h. That turns out to be easier said than done. In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.
To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command. This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file. This eliminates
the problem entirely and makes header safe for inclusion.
Also in this change I have removed the few places in the code where
spl_config.h is included. It is now added to the gcc compile line
to ensure the config results are always available.
Finally, I have also disabled the verbose kernel builds. If you
want them back you can always build with 'make V=1'. Since things
are working now they don't need to be on by default.
/lib/modules/$(uname -r)/source. This will likely fail when building
under a mock (http://fedoraproject.org/wiki/Projects/Mock) chroot
environment since `uname -r` will report the running kernel which
likely is not the kernel in your chroot. To cleanly handle this
we fallback to using the first kernel in your chroot.
The kernel-devel package which contains all the kernel headers and
a few build products such as Module.symver{s} is all the is required.
Full source is not needed.
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype. It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.
I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions. It's not pretty but API compat changes rarely are.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
These functions didn't exist for all archs prior to 2.6.24. This
patch addes an autoconf test to detect this and add them when needed.
The autoconf check is needed instead of just an #ifndef because in
the most modern kernels atomic64_{cmp}xchg are implemented as in
inline function and not a #define.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock. This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts. It however was
likely not good for performance.
After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely. The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t. The only lingering problem for both
implementations is that Solaris provides no atomic read function. This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking. I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API. So really
we are just being bug-for-bug compatible here.
With this change the default implementation is layered on top of Linux
atomic types. However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach. Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded. This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd. It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly. All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.
Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module. Because of this I have
had to add more autoconf magic than I'd like. However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.
In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type. And I've set rootdir to a non-NULL value.
For a generic explanation of why mutexs needed to be reimplemented
to work with the kernel lock profiling see commits:
e811949a57 and
d28db80fd0
The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.
Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved. The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.
Just as with the rwlocks we need to track the lock owner. Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you. If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked. This was added to Linux
to support adaptive mutexs, more on that shortly. Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build. If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior. This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.
Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance. From Linux kernel commit:
0d66bf6d3514b35eb6897629059443132992dbd7
"Testing with Ingo's test-mutex application...
gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol. The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.
Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds. This code was now redundant.
Update mutexs validated on:
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools. This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.
The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked. Since the
rwsem was embedded in a wrapper structure the name was always the
same. The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.
The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure. This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible. It can only be done by directly accessing some of
the internal data member of the rwsem structure.
For example, the Linux API provides a different function for dropping
a reader vs writer lock. Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is. This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock. Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation. This is
all do able, and what I did, but it does complicate things considerably.
The good news is that in addition to the profiling benefits of this
change. We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.
The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem. Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.
In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests. Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.
This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing. The full
SPLAT regression test suite was run numberous times on all of the following
platforms. This includes various kernels ranging from 2.6.16 to 2.6.29.
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state(). The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it. Several
autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
be to dynamically aquire it at load time. This is an absolute
last resort for old kernel which we don't want to patch to
cleanly export the symbol.
The previous credential implementation simply provided the needed types and
a couple of dummy functions needed. This update correctly ties the basic
Solaris credential API in to one of two Linux kernel APIs.
Prior to 2.6.29 the linux kernel embeded all credentials in the task
structure. For these kernels, we pass around the entire task struct as if
it were the credential, then we use the helper functions to extract the
credential related bits.
As of 2.6.29 a new credential type was added which we can and do fairly
cleanly layer on top of. Once again the helper functions nicely hide
the implementation details from all callers.
Three tests were added to the splat test framework to verify basic
correctness. They should be extended as needed when need credential
functions are added.
Modern kernel build systems at least post 2.6.16 will set this properly
so we should not. In fact post 2.6.28 the include headers have moved
under arch so the guess we make here is completely wrong. Letting
the kernel build system set this ensure it will be correct.
rpms. These should not be fatal because we actually don't need them
until we build the source rpm. When doing mock builds this is
important because these dependent rpms will only be installed if
they are specificed in the source rpms spec file.
- Allow checking for exported symbols in both Module.symvers
and Module.symvers. My stock SLES kernel ships an objects
directory with Module.symvers, yet produces a Module.symvers
in the local build directory.
- Properly honor --prefix in build system and rpm spec file.
- Add '--define require_kdir' to spec file to support building
rpms against kernel sources installed in non-default locations.
- Add '--define require_kobj' to spec file to support building
rpms against kernel object installed in non-default locations.
- Stop suppressing errors in autogen.sh script.
- Improved logic to detect missing kernel objects when they are
not located with the source. This is the common case for SLES
as well as in-tree chaos kernel builds and is done to simply
support for multiple arches.
- Moved spl-devel build products to /usr/src/spl-<version>, a
spl symlink is created to reference the last installed version.
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
Instead for_each_zone() operated directly on pgdat_list which may or
may not have been exported depending on how your kernel was compiled.
Now new configure checks determine if you have the helpers or not, and
if the needed symbols are exported. If they are not exported then they
are dynamically aquired at runtime by kallsyms_lookup_name().
- Configure check for SLES specific API change to vfs_unlink()
and vfs_rename() which added a 'struct vfsmount *' argument.
This was for something called the linux-security-module, but
it appears that it was never adopted upstream.
- Configure check for mutex_lock_nested(). This function was introduced
as part of the mutex validator in 2.6.18, but if it's unavailable then
it's safe to fallback to a plain mutex_lock().
- Configure check, the div64_64() function was renamed to
div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
in 2.6.18 kernels. The earlier 2.6.16 based SLES10 must not try
and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
the HZ division as an 'unsigned long long' because the SPL only
implements __udivdi3(), and not __divdi3() for 'long long' division
on 32-bit arches.
- Exclude -obj when detecting installed kernel source.
- Detect -obj directory for out of tree kernel builds.
- Allow kernel build system to set CC to ensure -m64 is set properly.
This is an issue on 64-bit SLES systems which by default always
build 32-bit binaries (unlike RHEL/Fedora which default to 64-bit)
We need dependent packages to be able to include spl_config.h so they
can leverage the configure checks the SPL has done. This is important
because several of the spl headers need the results of these checks to
work properly. Unfortunately, the autoheader build product is always
private to a particular build and defined certain common things.
(PACKAGE, VERSION, etc). This prevents other packages which also use
autoheader from being include because the definitions conflict. To
avoid this problem the SPL build system leverage AH_BOTTOM to include
a spl_unconfig.h at the botton of the autoheader build product. This
custom include undefs all known shared symbols to prevent the confict.
This does however mean that those definition are also not availble
to the SPL package either. The SPL package therefore uses the
equivilant SPL_META_* definitions.
In the interests of portability I have added a FC10/i686 box to
my list of development platforms. The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support. This patch
resolves all remaining issues observed under that environment.
1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added. As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages. The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state(). If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.
2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality. The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.
3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory. On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space. The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.
4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages(). This provides
additional test coverage of the SPL Linux VM integration.
5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M. On x86 based systems it is not possible to create
a kmem backed slab with entires of that size. To compensate for
this the number of allocations performed in increased by 4x.
6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.
7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol. This causes the module to fail
to load and we need to know exactly which symbol was not available.
As of 2.6.27 kernels the device_create() API changed to include
a private data argument. This check detects which version of
device_create() function the kernel has and properly defines
spl_device_create() to use the correct prototype.
Update the method used for determining which kernel to build against
when not specified. Previous 'uname -r' was used but this makes the
assumption that the running kernel is the one you want to use, this is
often not the case. It is better to examine the usual kernel-devel
install locations and select one of the installed kernels.
An update to the build system to properly support all commonly
used Makefile targets these include:
make all # Build everything
make install # Install everything
make clean # Clean up build products
make distclean # Clean up everything
make dist # Create package tarball
make srpm # Create package source RPM
make rpm # Create package binary RPMs
make tags # Create ctags and etags for everything
Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels. To build binary
RPMs from the source RPM for your system simply run:
rpmbuild --rebuild spl-x.y.z-1.src.rpm
This will produce two binary RPMs with correct 'requires'
dependencies for your kernel. One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.
spl-x.y.z-1_<kernel version>.x86_64.rpm
spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address. The function name itself can them become a
define which calls a function pointer. This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel. This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.
This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name(). There are
autoconf checks to detect if the symbol is exported and if so to
use it directly. We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported. Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement. The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.
Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
of $LINUX because Module.symvers is a build product. When
$LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
$LINUX/include search paths to allow proper compilation when
the kernel target build directory is not the source directory.
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning. On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.
All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior. Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree. These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.
When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered. When a minor is unregistered we use the vnode
interface to unlink the special file.
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
if the older 4 argument version of on_each_cpu() should be
used or the new 3 argument version. The retry argument was
dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers. The old way this
worked was to pass a data point when initialized the workqueue.
The new API assumed the work item is embedding in a structure
and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
type checked in the bit operations. This silences the warnings.
- Updated autogen products and splat tests accordingly
of the kernel specific build info in to config/kernel,
likewise and user specific build flags should go in
config/user. This seems like a reasonable way to go.
* configure.ac : Use AC_CONFIG_AUX_DIR to put autoconf products
in ./auotconf.
* autogen.sh : Use --copy to avoid symlinks, remove error
redirection, run aclocal before libtoolize.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@180 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c