Commit Graph

1651 Commits

Author SHA1 Message Date
Brian Behlendorf 5c7afad448 Fix cstyle issue from c66989b
Commit c66989b accidentally introduced a cstyle issue which went
unnoticed.  This tiny patch corrects that oversight.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-12-19 12:00:14 -08:00
Boris Protopopov 9063f65476 Correct error returns to unify cross-pool operation error handling
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2911
2014-12-19 10:51:24 -08:00
Andy Bakun c0ba93dee6 Fix typo in %post scriptlet lines
Missing space made the %post directive be part of the package
%description and not have a %post scriptlet defined.

Signed-off-by: Andy Bakun <github@thwartedefforts.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2961
2014-12-18 18:38:33 -08:00
Jacek Fefliński c66989baae zpool upgrade return errors to stderr instead of stdout
Signed-off-by: Jacek Feflinski <feflik@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2955
2014-12-18 17:38:18 -08:00
Dan Swartzendruber 1b95fd5d70 Improve systemd script to not leave stale sharetab
The systemd script zfs-share.service does 'zfs share -a' to share
any required datasets.  Unfortunately, /etc/dfs/sharetab is stale
from the previous boot.  Delete it before we share.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2883
2014-12-18 09:54:56 -08:00
Brian Behlendorf c944be5d7e Fix snapshots with dirty inodes
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode.  This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2812
2014-11-20 10:38:16 -08:00
Dan Swartzendruber 80c50365c2 Fix systemd config for zfs-share.service
The zfs-share.service rule needs to be modified to ensure that it
does not execute before zfs-mount.service.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ralf Ertzinger <ralf@skytale.net>
Closes #2893
2014-11-19 10:33:07 -08:00
Isaac Huang 29b763cd2c bio_alloc() with __GFP_WAIT never returns NULL
Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL.  However, we want to keep
the error handling in case this behavior changes in the futre.

Plus fix a small style issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2703
2014-11-19 12:50:49 -05:00
Ned Bass aaed7c408c Explicitly include SPL compat headers
Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages.  Include a
few compatiblity headers explicitly to cope with that change.  Also,
sort some linux-specific inclusions alphabetically.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2898
2014-11-19 12:30:39 -05:00
Ned Bass 7b2d78a046 Fix improper null-byte termination handling
Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.

- The snprintf() function always produces a null-byte terminated string
  for non-negative return values, so it is not necessary to write out a
  null-byte as a separate step.

- Also, it is unsafe to use the return value of snprintf() as an offset
  for placing a null-byte, because if the output was truncated the return
  value is the number of bytes that _would_ have been written had enough
  space been available. Therefore the return value may index beyond the
  array boundaries.

- Finally, snprintf() accounts for the null-byte when limiting its output
  size, so there is no need to pass it a size parameter that is one less
  than the buffer size.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2875
2014-11-17 15:28:59 -08:00
smh 89b1cd6581 Prevent ZFS leaking pool free space
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.

References:
  https://github.com/freebsd/freebsd/commit/48431b7
  http://svnweb.freebsd.org/base?view=revision&revision=273158
  https://reviews.csiden.org/r/132/

Porting notes:

This issue was filed as illumos 5347 and a more comprehensive fix is
under review.  Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2896
2014-11-17 11:35:38 -08:00
Tim Chase 4254acb057 Undirty freed spill blocks.
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884
2014-11-17 11:25:48 -08:00
Brian Behlendorf bc9f4131a1 Merge branch 'b_tracepoints'
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2874
2014-11-17 11:14:29 -08:00
Prakash Surya 0b39b9f96f Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.

The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:

    * A number of external tools can make use of our tracepoints
      "automatically" (e.g. perf, systemtap)
    * Tracepoints are designed to be extremely cheap when disabled
    * It's one of the "accepted" ways to export this kind of
      information; many other kernel subsystems use tracepoints too.

Unfortunately, though, there are a few caveats as well:

    * Linux tracepoints appear to only be available to GPL licensed
      modules due to the way certain kernel functions are exported.
      Thus, to actually make use of the tracepoints introduced by this
      patch, one might have to patch and re-compile the kernel;
      exporting the necessary functions to non-GPL modules.

    * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
      tracepoints are not available for unsigned kernel modules
      (tracepoints will get disabled due to the module's 'F' taint).
      Thus, one either has to sign the zfs kernel module prior to
      loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
      newer.

Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):

    # list all zfs tracepoints available

    $ ls /sys/kernel/debug/tracing/events/zfs
    enable              filter              zfs_arc__delete
    zfs_arc__evict      zfs_arc__hit        zfs_arc__miss
    zfs_l2arc__evict    zfs_l2arc__hit      zfs_l2arc__iodone
    zfs_l2arc__miss     zfs_l2arc__read     zfs_l2arc__write
    zfs_new_state__mfu  zfs_new_state__mru

    # enable all zfs tracepoints, clear the tracepoint ring buffer

    $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
    $ echo 0 > /sys/kernel/debug/tracing/trace

    # import zpool called 'tank', inspect tracepoint data (each line was
    # truncated, they're too long for a commit message otherwise)

    $ zpool import tank
    $ cat /sys/kernel/debug/tracing/trace | head -n35
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 1219/1219   #P:8
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
            lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
          z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
          z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
          z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
          z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
          z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
          z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
          z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...

To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:

    lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
        hdr {
            dva 0x1:0x40082
            birth 15491
            cksum0 0x163edbff3a
            flags 0x640
            datacnt 1
            type 1
            size 2048
            spa 3133524293419867460
            state_type 0
            access 0
            mru_hits 0
            mru_ghost_hits 0
            mfu_hits 0
            mfu_ghost_hits 0
            l2_hits 0
            refcount 1
        } bp {
            dva0 0x1:0x40082
            dva1 0x1:0x3000e5
            dva2 0x1:0x5a006e
            cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
            lsize 2048
        } zb {
            objset 0
            object 0
            level -1
            blkid 0
        }

For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.

For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:

    * http://lwn.net/Articles/379903/
    * http://lwn.net/Articles/381064/
    * http://lwn.net/Articles/383362/

I should also node the more "boring" aspects of this patch:

    * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
       support a sixth paramter. This parameter is used to populate the
       contents of the new conftest.h file. If no sixth parameter is
       provided, conftest.h will be empty.

    * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
      This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
      except it has support for a fifth option that is then passed as
      the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.

These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).

    * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
      is to determine if we can safely enable the Linux tracepoint
      functionality. We need to selectively disable the tracepoint code
      due to the kernel exporting certain functions as GPL only. Without
      this check, the build process will fail at link time.

In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:55 -08:00
Ned Bass 5024046763 cstyle: allow right paren on its own line
Make the style checker script accept right parentheses on their own
lines. This is motivated by the Linux tracepoints macro
DECLARE_EVENT_CLASS.

The code within TP_fast_assign() (a parameter of DECLARE_EVENT_CLASS)
is normal C assignments terminated by semicolons.  But the style
checker forbids us from following a semicolon with a non-blank and
from preceding a right parenthesis with white space.  Therefore the
closing parenthesis must go on the next line, yet the style checker
foribs us from indenting it for readability.  Relaxing the
no-non-blank-after-semicolon rule would open the door to too many bad
style practices. So instead we relax the
no-white-space-before-right-paren rule if the parenthesis is on its
own line.  The relaxation is overriden with the -p option so we still
have a way to catch misuse of this style.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:50 -08:00
Ned Bass 29e57d15c8 Fix dprintf format specifiers
Fix a few dprintf format specifiers that disagreed with their argument
types.  These came to light as compiler errors when converting dprintf
to use the Linux trace buffer.  Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:45 -08:00
Ned Bass 59ec819a0c Move a few internal ARC strucutres to arc_impl.h
Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:27 -08:00
Randall Mason 33c0819425 Fix small spelling mistake
recieve becomes receive

Signed-off-by: Randall Mason <ClashTheBunny@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2877
2014-11-14 15:48:51 -08:00
Prakash Surya fb42a49328 Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO
5213 panic in metaslab_init due to space_map_open returning ENXIO
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com

References:
  https://www.illumos.org/issues/5213
  https://reviews.csiden.org/r/110

Porting notes:

For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2745
2014-11-14 15:37:45 -08:00
Isaac Huang a82db4e15f Print header properly when terminal resizes
Added a handler for SIGWINCH, so that one header
is printed per screen even when the terminal resizes.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2847
2014-11-14 15:32:30 -08:00
Isaac Huang 1c49ac575d Fix inaccurate field descriptions
The field descriptions from arcstat.py -v for the demand accesses
are inaccurate. They all begin with "Demand Data" yet the fields
actually covered both demand data and demand meta-data accesses.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2842
2014-11-14 15:09:09 -08:00
Chris Wedgwood b31d8ea77c Reduce buf/dbuf mutex contention
Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.

This increase in hash table size adds approximating 0.5M to
our fixed memory footprint.  This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized.  In the meanwhile, this small
change significantly reduces contention for certain workloads.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #1291
2014-11-14 14:59:21 -08:00
Alex Zhuravlev 0f69910833 Export symbols for ZIL interface
These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL.  In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.

Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2892
2014-11-14 14:39:43 -08:00
Dan Swartzendruber 5f91bd3dea Improve zvol symlink handling.
Change the zvol helper program to replace any embedded spaces
in the pool or dataset names with '+' to ensure we have valid
symlinks.

The '+' character was choosen because it is not a valid character
for a dataset name but it is allowed by udev.  This ensures that
all dataset names with an embedded space will be translated to
a unique /dev/zvol/ symlink.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Darik Horn <dajhorn@vanadac.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2834
2014-11-06 11:13:18 -08:00
Marcel Wysocki 11662bf969 Add config/compile to config/.gitignore
This file may be added by automake and therefore should be added
to config/.gitignore.  For the full list of possible auxiliary
programs see the full automake documentation.

http://www.gnu.org/software/automake/manual/automake.html#Auxiliary-Programs

Signed-off-by: Marcel Wysocki <maci.stgn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2848
2014-10-31 16:25:34 -07:00
Alexander Pyhalov bb9d808c5a Fix modules installation directory
When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.

Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2822
2014-10-28 09:46:14 -07:00
Richard Yao b76707027c Make systemd-modules-load.service file directory configurable
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. We could handle this by disabling systemd support on
prefix because systemd does not check these paths, but the Gentoo
Council decided that small files such as these should be installed.
That means disabling systemd support on prefix is not an acceptable
workaround. As a consequence, we need some way of control the directory
into which these files are installed.

Making this configurable increases our compliance with the
freedesktop.org specification, which allows these files to be installed
into /etc/modules-load.d:

http://www.freedesktop.org/software/systemd/man/modules-load.d.html

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
2014-10-28 09:41:14 -07:00
Richard Yao 60e9f69c97 Make directory into which mount.zfs is installed configurable
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. I could script a workaround inside the ebuild, but it
seemed to make more sense to make this more configurable.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
2014-10-28 09:40:59 -07:00
Richard Yao d8d7826721 Search /usr/local/src for SPL Object Directory
Since we changed the default location for the kernel headers to respect
--prefix in the SPL, we must search that location to prevent user builds
from breaking.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
2014-10-28 09:37:23 -07:00
Richard Yao 3cd33ffc3b Kernel header installation should respect --prefix
This is the upstream component of work that enables preliminary support
for building Gentoo's ZFS packaging on other Linux systems via Gentoo
Prefix.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
2014-10-28 09:37:06 -07:00
Tim Chase ed6e9cc235 Linux 3.12 compat: shrinker semantics
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects.  It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2837
2014-10-28 09:34:51 -07:00
Matthew Ahrens 9635861742 Illumos 5164-5165 - space map fixes
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
     space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5164
  https://www.illumos.org/issues/5165
  https://github.com/illumos/illumos-gate/commit/b1be289

Porting Notes:

The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.

The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz.  The upstream commit incorrectly
used space_map_blksize.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697
2014-10-23 15:30:32 -07:00
Alex Reece b02fe35d37 Illumos 4958 zdb trips assert on pools with ashift >= 0xe
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4958
  https://github.com/illumos/illumos-gate/commit/2a104a5

Porting notes:

Keep the ZIO_FLAG_FASTWRITE define.  This is for a feature present
in Linux but not yet in *BSD.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697
2014-10-23 15:30:32 -07:00
Brian Behlendorf adc90e9d94 Fix zdb segfault
On 32-bit systems setting 'zfs_arc_max = 256M' in zdb results in the
following segmentation fault.  Rather than reverting 0ec0724 which
introduced this flaw this code is only used for 64-bit builds.

Segmentation fault (core dumped)
ztest: '/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 139
child exited with code 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-23 15:30:32 -07:00
Matthew Ahrens 0ec072487b Illumos 5169-5171 - zdb fixes
5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5169
  https://www.illumos.org/issues/5170
  https://www.illumos.org/issues/5171
  https://github.com/illumos/illumos-gate/commit/06be980

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2707
2014-10-23 10:27:55 -07:00
Matthew Ahrens acf58e706c Illumos 5178 - zdb -vvvvv on old-format pool fails in dump_deadlist()
5178 zdb -vvvvv on old-format pool fails in dump_deadlist()
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5178
  https://github.com/illumos/illumos-gate/commit/90c76c6

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2713
2014-10-23 10:18:36 -07:00
ilovezfs 023bbe6f01 Fix zpool create -t ENOENT bug.
In userland we need to switch over to the temporary name once the
pool has been created, otherwise the root dataset won't mount
and the error "cannot open 'the_real_name': dataset does not exist"
is printed.

Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2760
2014-10-23 10:00:35 -07:00
Brian Behlendorf 5f6d0b6f5a Handle block pointers with a corrupt logical size
The general strategy used by ZFS to verify that blocks are valid is
to checksum everything.  This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block.  If a blocks checksum is valid then its contents are
trusted by the higher layers.

This system works exceptionally well as long as bad data is never
written with a valid checksum.  If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.

One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.

To prevent this from happening the arc_read() function has been
updated to detect this specific case.  If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2678
2014-10-23 09:20:52 -07:00
Ned Bass bc151f7b31 Remove checks for mandatory locks
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp().  Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose.  Rather than emulating those interfaces we remove the
redundant checks.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804
2014-10-22 11:06:53 -07:00
Matthew Ahrens 88904bb3e3 Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy
5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5162
  https://github.com/illumos/illumos-gate/commit/8a90470

Porting notes:
  Fix spelling error 's/arena/area/' in dmu.c.
  In restore_write() declare bonus and abuf at the top of the function.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2696
2014-10-21 16:32:11 -07:00
Matthew Ahrens 4b20a6f509 Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness
When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!

Steps to reproduce:

  * zpool create test c1t1d0
  * zfs create test/fs
  * zfs snapshot test/fs@a
  * zfs clone test/fs@a test/clone
  * zfs destroy -d test/fs@a
  * zfs clone test/fs@a test/clone2
  * zfs snapshot test/clone2@a
  * zfs hold hld test/clone2@a
  * zfs release hld test/clone2@a
  * zfs list -r -t all test

  <test/clone2@a has been destroyed>

We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.

5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5150
  https://github.com/illumos/illumos-gate/commit/42fcb65

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2690
2014-10-21 15:26:58 -07:00
Matthew Ahrens 6c59307a3c Illumos 3693 - restore_object uses at least two transactions to restore an object
Restore_object should not use two transactions to restore an object:
  * one transaction is used for dmu_object_claim
  * another transaction is used to set compression, checksum and most
    importantly bonus data
  * furthermore dmu_object_reclaim internally uses multiple transactions
  * dmu_free_long_range frees chunks in separate transactions
  * dnode_reallocate is executed in a distinct transaction

The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups.  Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.

3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon

References:
  https://www.illumos.org/issues/3693
  https://github.com/illumos/illumos-gate/commit/e77d42e

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2689
2014-10-21 15:26:50 -07:00
Tim Chase 356d9ed4c8 Don't perform ACL-to-mode translation on empty ACL
In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL.  If no ACL exists, this
processing shouldn't be done.  Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:

	# create a file with typical mode of 664
	echo test > test
	chown anyuser test
	ls -l test

and the mode will show up as all zeroes.  Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1264
2014-10-21 09:23:27 -07:00
Daniil Lunev 62bdd5eb7a Illumos 4924 - LZ4 Compression for metadata
Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://github.com/illumos/illumos-gate/commit/b8289d2
  https://www.illumos.org/issues/3756

Porting notes:

The static function zfs_prop_activate_feature() was removed because
this change removes the only caller.  The function was not removed
from Illumos but instead left as dead code.  However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1540
2014-10-20 16:17:49 -07:00
Brian Behlendorf ba232d8aea Suppress AIO kmem warnings
The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc()
to allocate enough memory to hold the vectorized IO.  While this
allocation will be small it's been observed in practice to sometimes
slightly exceed the 8K warning threshold by a few kilobytes.
Therefore, the KM_NODEBUG flag has been added to suppress warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2774
2014-10-20 16:10:25 -07:00
Darik Horn d94fd5f0de Let `zpool import` ignore a missing hostid record.
Change the zpool program to skip its hostid mismatch check in the
same way that libzfs already does.

Invoked imports fail if the ZPOOL_CONFIG_HOSTID nvpair is missing in
the /etc/zfs/zpool.cache file, which can happen as of the /etc/hostid
deprecation in commit zfsonlinux/spl@acf0ade362.

Signed-off-by: Darik Horn <dajhorn@vanadac.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2794
2014-10-17 14:59:05 -07:00
Brian Behlendorf 33074f2254 Handle NULL mirror child vdev
When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev.  In
this case the child should be skipped and the read issues to
another member of the mirror.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1744
2014-10-17 14:59:05 -07:00
Brian Behlendorf f0e324f25d Update utsname support
Modify the code to use the utsname() kernel function rather than
a global variable.  This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2).  This means that it
will behave consistently in both contexts.

This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions.  And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.

Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:57 -07:00
Brian Behlendorf 050d22b068 Remove shrink_dcache_memory() and shrink_icache_memory()
This functionality is optional and until Linux 3.0, which
provided per-filesystem shinkers, they was never a reasonable
interface.  Therefore, this functionality is being dropped
for earlier kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:50 -07:00
Brian Behlendorf 60bba62814 Update code to use misc_register()/misc_deregister()
When ZPIOS was originally written it was designed to use the
device_create() and device_destroy() functions.  Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.

As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions.  This interface
for registering character devices has remained stable, is simple,
and provides everything we need.

Therefore the code has been reworked to use this interface.  The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:44 -07:00