Commit Graph

1013 Commits

Author SHA1 Message Date
Tim Chase fd4f76160c Handle concurrent snapshot automounts failing due to EBUSY.
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently.  Only one of the mounts will
succeed and the other will fail.  The failed mounts will cause an EREMOTE
to be propagated back to the application.

This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY.  The zfs code detects this condition and
treats it as if the mount had succeeded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1819
2013-11-08 10:45:14 -08:00
Massimo Maggi b695c34ea4 Honor CONFIG_FS_POSIX_ACL kernel option
The required Posix ACL interfaces are only available for kernels
with CONFIG_FS_POSIX_ACL defined.  Therefore, only enable Posix
ACL support for these kernels.  All major distribution kernels
enable CONFIG_FS_POSIX_ACL by default.

If your kernel does not support Posix ACLs the following warning
will be printed at ZFS module load time.

  "ZFS: Posix ACLs disabled by kernel"

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1825
2013-11-05 16:22:05 -08:00
Matthew Ahrens 78e2739d3c 26126 panic system rather than corrupting pool if we hit bug 26100
References:
  delphix/delphix-os@931c8aaab7

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1650
2013-11-05 13:18:26 -08:00
Brian Behlendorf 2517c8ee08 Switch allocations from KM_SLEEP to KM_PUSHPAGE
A couple of kmem_alloc() allocations were using KM_SLEEP in
the sync thread context.  These were accidentally introduced
by the recent set of Illumos patches.  The solution is to
switch to KM_PUSHPAGE.

dsl_dataset_promote_sync() -> promote_hold() -> snaplist_make() ->
kmem_alloc(sizeof (*snap), KM_SLEEP);

dsl_dataset_user_hold_sync() -> dsl_onexit_hold_cleanup() ->
kmem_alloc(sizeof (*ca), KM_SLEEP)

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:26:14 -08:00
Saso Kiselkov 1ca546b338 Illumos #3995
3995 Memory leak of compressed buffers in l2arc_write_done

References:
  https://illumos.org/issues/3995

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1688
Issue #1775
2013-11-05 12:26:00 -08:00
George Wilson 43a696ed38 Illumos #4168, #4169, #4170
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4168
  https://www.illumos.org/issues/4169
  https://www.illumos.org/issues/4170
  illumos/illumos-gate@7fdd916c47

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:25:44 -08:00
Matthew Ahrens 92bc214c2e Illumos #4082
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/4082
  illumos/illumos-gate@5253393b09

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:25:26 -08:00
George Wilson ac72fac3ea Illumos #3954, #4080, #4081
3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3954
  https://www.illumos.org/issues/4080
  https://www.illumos.org/issues/4081
  illumos/illumos-gate@22e30981d8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:25:01 -08:00
Matthew Ahrens a169a625a6 Illumos #4046
4046 dsl_dataset_t ds_dir->dd_lock is highly contended
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4046
  illumos/illumos-gate@b62969f868

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. This commit removed dsl_dataset_namelen in Illumos, but that
   appears to have been removed from ZFSOnLinux in an earlier commit.
2013-11-05 12:24:24 -08:00
Matthew Ahrens b663a23d36 Illumos #4047
4047 panic from dbuf_free_range() from dmu_free_object() while
     doing zfs receive
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4047
  illumos/illumos-gate@713d6c2088

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The exported symbol dmu_free_object() was renamed to
   dmu_free_long_object() in Illumos.
2013-11-05 12:23:35 -08:00
Matthew Ahrens 46ba1e59d3 Illumos #3996
3996 want a libzfs_core API to rollback to latest snapshot
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3996
  illumos/illumos-gate@a7027df17f

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:23:11 -08:00
George Wilson 5d1f7fb647 Illumos #3956, #3957, #3958, #3959, #3960, #3961, #3962
3956 ::vdev -r should work with pipelines
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3956
  https://www.illumos.org/issues/3957
  https://www.illumos.org/issues/3958
  https://www.illumos.org/issues/3959
  https://www.illumos.org/issues/3960
  https://www.illumos.org/issues/3961
  https://www.illumos.org/issues/3962
  illumos/illumos-gate@b4952e17e8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

1. zfs_dbgmsg_print() is only used in userland. Since we do not have
   mdb on Linux, it does not make sense to make it available in the
   kernel. This means that a build failure will occur if any future
   kernel patch depends on it. However, that is unlikely given that
   this functionality was added to support zdb.

2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels.
   This preserves the existing behavior of minimal noise when running
   with -V, and -VV.

3. In vdev_config_generate() the call to nvlist_alloc() was not
   changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in
   the txg_sync context.
2013-11-05 12:23:05 -08:00
George Wilson 621dd7bb2c Illumos #3949, #3950, #3952, #3953
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3949
  https://www.illumos.org/issues/3950
  https://www.illumos.org/issues/3951
  https://www.illumos.org/issues/3952
  illumos/illumos-gate@2c1e2b4414

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The deadman thread was removed from ztest during the original
   port because it depended on Solaris thr_create() interface.
   This functionality should be reintroduced using the more
   portable pthreads.
2013-11-05 12:17:07 -08:00
Matthew Ahrens 383fc4a997 Illumos #3955
3955 ztest failure: assertion refcount_count(&tx->tx_space_written) +
     delta <= tx->tx_space_towrite
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3955
  illumos/illumos-gate@be9000cc67

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:16:14 -08:00
Steven Hartland 9554185d90 Illumos #3973
3973 zfs_ioc_rename alters passed in zc->zc_name
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3973
  illumos/illumos-gate@a0c1127b14

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:15:50 -08:00
Matthew Ahrens ea97f8ce35 Illumos #3834
3834 incremental replication of 'holey' file systems is slow
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3834
  illumos/illumos-gate@ca48f36f20

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:15:00 -08:00
Matthew Ahrens 2883cad5b7 Illumos #3836
3836 zio_free() can be processed immediately in the common case
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3836
  illumos/illumos-gate@9cb154a3c9

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:14:56 -08:00
Matthew Ahrens 498877baf5 Illumos #3112, #3113, #3114
3112 ztest does not honor ZFS_DEBUG
3113 ztest should use watchpoints to protect frozen arc bufs
3114 some leaked nvlists in zfsdev_ioctl

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Amdur <Matt.Amdur@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/3112
  https://www.illumos.org/issues/3113
  https://www.illumos.org/issues/3114
  illumos/illumos-gate@cd1c8b85eb

The /proc/self/cmd watchpoint interface is specific to Solaris.
Therefore, the #3113 implementation was reworked to use the more
portable mprotect(2) system call.  When the pages are watched they
are marked read-only for protection.  Any write to the protected
address range immediately trigger a SIGSEGV.  The pages are marked
writable again when they are unwatched.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
2013-11-05 12:14:48 -08:00
George Wilson 03c6040bee Illumos #3236
3236 zio nop-write
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@80901aea8e
  https://www.illumos.org/issues/3236

Porting Notes

1. This patch is being merged dispite an increased instance of
   https://www.illumos.org/issues/3113 being triggered by ztest.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
2013-11-05 12:14:21 -08:00
Keith M Wesolowski 831baf06ef Illumos #3875
3875 panic in zfs_root() after failed rollback
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/3875
  illumos/illumos-gate@91948b51b8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 11:27:41 -08:00
Matthew Ahrens 1958067629 Illumos #3888
3888 zfs recv -F should destroy any snapshots created since
     the incremental source
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3888
  illumos/illumos-gate@34f2f8cf94

Porting notes:

1. Commit 1fde1e3720 wrapped a
   declaration in dsl_dataset_modified_since_lastsnap in ASSERTV().
   The ASSERTV() and local variable have been removed to avoid an
   unused variable warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Richard Yao <ryao@gentoo.org>
Issue #1775
2013-11-04 11:18:14 -08:00
Keith M Wesolowski 96c2e96193 Illumos #3894
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/3894
  illumos/illumos-gate@ca48f36f20

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 11:18:14 -08:00
Matthew Ahrens 1a077756e8 Illumos #3829
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3829
  illumos/illumos-gate@bb6e70758d

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 11:18:14 -08:00
Steven Hartland 95fd54a1c5 Illumos #3740
3740 Poor ZFS send / receive performance due to snapshot
     hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3740
  illumos/illumos-gate@a7a845e4bf

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. 13fe019870 introduced a merge conflict
   in dsl_dataset_user_release_tmp where some variables were moved
   outside of the preprocessor directive.

2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
   conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
   because this commit refactors the code, adding a new KM_SLEEP
   allocation. It is not clear to me whether this should be converted
   to KM_PUSHPAGE.

3. We had a merge conflict in libzfs_sendrecv.c because of copyright
   notices.

4. Several small C99 compatibility fixed were made.
2013-11-04 11:17:48 -08:00
Will Andrews d09f25dc66 Illumos #3744
3744 zfs shouldn't ignore errors unmounting snapshots
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3744
  illumos/illumos-gate@fc7a6e3fef

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. There is no clear way to distinguish between a failure when we
   tried to unmount the snapdir of a zvol (which does not exist)
   and the failure when we try to unmount a snapdir of a dataset,
   so the changes to zfs_unmount_snap() were dropped in favor of
   an altered Linux function that unconditionally returns 0.
2013-11-04 10:55:25 -08:00
Will Andrews 3a84951d7d Illumos #3743
3743 zfs needs a refcount audit
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3743
  illumos/illumos-gate@b287be1ba8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Will Andrews d3cc8b152e Illumos #3742
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3742
  illumos/illumos-gate@f717074149

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The change to zfs_vfsops.c was dropped because it involves
   zfs_mount_label_policy, which does not exist in the Linux port.
2013-11-04 10:55:25 -08:00
Will Andrews e49f1e20a0 Illumos #3741
3741 zfs needs better comments
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3741
  illumos/illumos-gate@3e30c24aee

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Martin Matuska b1118acbb1 Illumos #3699, #3739
3699 zfs hold or release of a non-existent snapshot does not output error
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Shrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3699
  https://www.illumos.org/issues/3739
  illumos/illumos-gate@013023d4ed

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Adam Leventhal 63fd3c6cfd Illumos #3582, #3584
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/3582
    illumos/illumos-gate@0689f76

Ported by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Mark Shellenbaum c1fabe7961 6977619 NULL pointer deference in sa_handle_get_from_db()
References:
  illumos/illumos-gate@44bffe012c

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:54:48 -08:00
Mark Shellenbaum c0ebc844c7 6939941 problem with moving files in zfs
References:
  illumos/illumos-gate@d39ee142a9

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. This commit was so old that only two lines applied to the modern
   code base.
2013-11-04 10:53:18 -08:00
George Wilson 2696dfafd9 Illumos #3642, #3643
3642 dsl_scan_active() should not issue I/O to determine if async
     destroying is active
3643 txg_delay should not hold the tc_lock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/3642
  https://www.illumos.org/issues/3643
  illumos/illumos-gate@4a92375985

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting Notes:

1. The alignment assumptions for the tx_cpu structure assume that
   a kmutex_t is 8 bytes.  This isn't true under Linux but tc_pad[]
   was adjusted anyway for consistency since this structure was
   never carefully aligned in ZoL.  If careful alignment does impact
   performance significantly this should be reworked to be portable.
2013-11-01 08:55:12 -07:00
Matthew Ahrens 7ec09286b7 Illumos #3645, #3692
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3645
  https://www.illumos.org/issues/3692
  illumos/illumos-gate@de8d9cff56

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1792
Issue #1775
2013-10-31 14:58:09 -07:00
Matthew Ahrens 2e528b49f8 Illumos #3598
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3598
  illumos/illumos-gate@be6fd75a69

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. include/sys/zfs_context.h has been modified to render some new
   macros inert until dtrace is available on Linux.

2. Linux-specific changes have been adapted to use SET_ERROR().

3. I'm NOT happy about this change.  It does nothing but ugly
   up the code under Linux.  Unfortunately we need to take it to
   avoid more merge conflicts in the future.  -Brian
2013-10-31 14:58:04 -07:00
Yuri Pankov 7011fb6004 Illumos #3517
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3517
  illumos/illumos-gate@efb4a871d8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-31 14:57:59 -07:00
Matthew Ahrens d1fada1e6d Illumos #3603, #3604: bobj improvements
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely
3871 GCC 4.5.3 does not like issue 3604 patch
Reviewed by: Henrik Mattson <henrik.mattson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3603
  https://www.illumos.org/issues/3604
  https://www.illumos.org/issues/3871
  illumos/illumos-gate@d04756377d

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Note that the patch from Illumos issue 3871 is not accepted into Illumos
at the time of this writing. It is something that I wrote when porting
this. Documentation is in the Illumos issue.
2013-10-31 14:57:51 -07:00
Matthew Ahrens 24a64651b4 Illumos #3588
3588 provide zfs properties for logical (uncompressed) space
     used and referenced
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3588
  illumos/illumos-gate@77372cb0f3

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-31 10:16:11 -07:00
George Wilson c2e42f9d53 Illumos #3578, #3579
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3578
  https://www.illumos.org/issues/3579
  illumos/illumos-gate@9eb57f7f3f

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-31 09:23:40 -07:00
George Wilson 23c0a1333c Illumos #3561, #3116
3561 arc_meta_limit should be exposed via kstats
3116 zpool reguid may log negative guids to internal SPA history
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3561
  https://www.illumos.org/issues/3116
  illumos/illumos-gate@20128a0826

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:

1. The spa change was accidentally included in the libzfs_core merge.

2. "Add missing arcstats" (1834f2d8b7)
   already implemented these kstats a few years ago.
2013-10-31 09:23:40 -07:00
Matthew Ahrens 330847ff36 Illumos #3537
3537 want pool io kstats

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  http://www.illumos.org/issues/3537
  illumos/illumos-gate@c3a6601

Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:

1. The patch was restructured to take advantage of the existing
   spa statistics infrastructure.  To accomplish this the kstat
   was moved in to spa->io_stats and the init/destroy code moved
   to spa_stats.c.

2. The I/O kstat was simply named <pool> which conflicted with the
   pool directory we had already created.  Therefore it was renamed
   to <pool>/io

3. An update handler was added to allow the kstat to be zeroed.
2013-10-31 09:16:03 -07:00
George Wilson a117a6d66e Illumos #3522
3522 zfs module should not allow uninitialized variables
Reviewed by: Sebastien Roy <seb@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3522
  illumos/illumos-gate@d5285cae91

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

1. ZFSOnLinux had already addressed many of these issues because of
   its use of -Wall. However, the manner in which they were addressed
   differed. The illumos fixes replace the ones previously made in
   ZFSOnLinux to reduce code differences.

2. Part of the upstream patch made a small change to arc.c that might
   address zfsonlinux/zfs#1334.

3. The initialization of aclsize in zfs_log_create() differs because
   vsecp is a NULL pointer on ZFSOnLinux.

4. The changes to zfs_register_callbacks() were dropped because it
   has diverged and needs to be resynced.
2013-10-30 14:51:27 -07:00
Richard Yao 495b25a91a Add missing code to zfs_debug.{c,h}
This is required to make Illumos 3962 merge.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2013-10-29 15:06:18 -07:00
Richard Yao 632a242e83 Add missing copyright notices from Illumos
This resolves merge conflicts when merging Illumos #3588 and Illumos #4047.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Richard Yao 20f04f08aa Fix incorrect usage of strdup() in zfs_unmount_snap()
Modifying the length of a string returned by strdup() is incorrect
because strfree() is allowed to use strlen() to determine which slab
cache was used to do the allocation.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Richard Yao 8c8417933f Fix order of function calls in zio_free_sync()
The resolution of a merge conflict when merging Illumos #3464 caused us
to invert the order couple of function calls in zio_free_sync() versus
what they are in Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Richard Yao 9cac042cfe Reintroduce uio_prefaultpages()
This was accidentally removed by overzealous commenting.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Massimo Maggi 023699cd62 Posix ACL Support
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems.  Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set.  The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.

By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property.  Set the property to
'posixacl' to enable them.  If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.

This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).

  http://www.tuxera.com/community/posix-test-suite/

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #170
2013-10-29 14:54:26 -07:00
Brian Behlendorf fc9e0530c9 Prevent xattr remove from creating xattr directory
Attempting to remove an xattr from a file which does not contain
any directory based xattrs would result in the xattr directory
being created.  This behavior is non-optimal because it results
in write operations to the pool in addition to the expected error
being returned.

To prevent this the CREATE_XATTR_DIR flag is only passed in
zpl_xattr_set_dir() when setting a non-NULL xattr value.  In
addition, zpl_xattr_set() is updated similarly such that it will
return immediately if passed an xattr name which doesn't exist
and a NULL value.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170
2013-10-29 13:23:53 -07:00
Richard Yao c12e3a594a Restructure zfs_readdir() to fix regressions
This does the following:

1. It creates a uint8_t type value, which is initialized to DT_DIR on
dot directories and ZFS_DIRENT_TYPE(zap.za_first_integer) otherwise.
This resolves a regression where we return unintialized values as the
directory entry type on dot directories. This was accidentally
introduced by commit 8170d28126.

2. It restructures zfs_readdir() code to use `uint64_t offset` like
Illumos instead of `loff_t *pos`. This resolves a regression where
negative ZAP cursors were treated as if they were dot directories.

3. It restructures the function to more closely match the structure of
zfs_readdir() on Illumos and removes the unused variable outcount, which
was only used on Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1750
2013-10-29 09:51:59 -07:00
Brian Behlendorf e0b0ca983d Add visibility in to cached dbufs
Currently there is no mechanism to inspect which dbufs are being
cached by the system.  There are some coarse counters in arcstats
by they only give a rough idea of what's being cached.  This patch
aims to improve the current situation by adding a new dbufs kstat.

When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash.  For each dbuf it will dump detailed information
about the buffer.  It will also dump additional information about
the referenced arc buffer and its related dnode.  This provides a
more complete view in to exactly what is being cached.

With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working.  For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming.  Or a
similar list could be generated based on dnode type.  Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
2013-10-25 13:59:40 -07:00
Brian Behlendorf 2d37239a28 Add visibility in to dmu_tx_assign times
This change adds a new kstat to gain some visibility into the
amount of time spent in each call to dmu_tx_assign. A histogram
is exported via the new dmu_tx_assign file. The information
contained in this histogram is the frequency dmu_tx_assign
took to complete given an interval range.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Brian Behlendorf 0b1401ee91 Add visibility in to txg sync behavior
This change is an attempt to add visibility in to how txgs are being
formed on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to txg. These entries are then exported through the kstat
interface, which can then be interpreted in userspace.

For each txg, the following information is exported:

 * Unique txg number (uint64_t)
 * The time the txd was born (hrtime_t)
   (*not* wall clock time; relative to the other entries on the list)
 * The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted)
 * The number of reserved bytes for the txg (uint64_t)
 * The number of bytes read during the txg (uint64_t)
 * The number of bytes written during the txg (uint64_t)
 * The number of read operations during the txg (uint64_t)
 * The number of write operations during the txg (uint64_t)
 * The time the txg was closed (hrtime_t)
 * The time the txg was quiesced (hrtime_t)
 * The time the txg was synced (hrtime_t)

Note that while the raw kstat now stores relative hrtimes for the
open, quiesce, and sync times.  Those relative times are used to
calculate how long each state took and these deltas and printed by
output handlers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Prakash Surya 1421c89142 Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.

For each arc_read call, the following information is exported:

 * A unique identifier (uint64_t)
 * The time the entry was added to the list (hrtime_t)
   (*not* wall clock time; relative to the other entries on the list)
 * The objset ID (uint64_t)
 * The object number (uint64_t)
 * The indirection level (uint64_t)
 * The block ID (uint64_t)
 * The name of the function originating the arc_read call (char[24])
 * The arc_flags from the arc_read call (uint32_t)
 * The PID of the reading thread (pid_t)
 * The command or name of thread originating read (char[16])

From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.

There is still some work to be done, but this should serve as a good
starting point.

Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Brian Behlendorf 76463d4026 Revert "Add txgs-<pool> kstat file"
This reverts commit e95853a331.
2013-10-25 13:57:25 -07:00
Brian Behlendorf 98ab38d109 Revert "Add new kstat for monitoring time in dmu_tx_assign"
This reverts commit 92334b14ec.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Richard Yao b3c49d3df8 Linux 3.11 compat: Rename LZ4 symbols
Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict
whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the
kernel's .config. We rename the symbols to avoid the conflict.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1789
2013-10-22 10:12:39 -07:00
Tim Chase fbcb768c8f Add missing dsl pool configuration lock
The semantics introduced by the restructured sync task of illumos
3464 require this lock when calling dmu_snapshot_list_next().
The pool is locked/unlocked for each iteration to reduce the
chance of long-running locks.

This was accidentally missed when doing the original port because
ZoL's control directory code is Linux-specific and is in a
different file than in illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1785
2013-10-22 08:31:20 -07:00
George Wilson 7a61440761 Illumos #3552
3552 condensing one space map burns 3 seconds of CPU in spa_sync()
     thread (fix race condition)

References:
  https://www.illumos.org/issues/3552
  illumos/illumos-gate@03f8c36688

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@e51be06697, which
ported the Illumos 3552 changes. This fix was added to upstream
rather quickly, but at the time of the port, no one spotted it and
the race was rare enough that it passed our regression tests. I
discovered this when comparing our metaslab.c to the illumos
metaslab.c.

Without this change it is possible for metaslab_group_alloc() to
consume a large amount of cpu time.  Since this occurs under a
mutex in a rcu critical section the kernel will log this to the
console as a self-detected cpu stall as follows:

  INFO: rcu_sched self-detected stall on CPU { 0}
  (t=60000 jiffies g=11431890 c=11431889 q=18271)

Closes #1687
Closes #1720
Closes #1731
Closes #1747
2013-10-18 14:34:01 -07:00
Ned Bass 40a806df25 Export symbols dsl_pool_config_{enter,exit}
These are needed by consumers (i.e. Lustre) who wish to use the
dsl_prop_register() interface to register callbacks when pool
properties of interest change.  This interface requires that the
DSL pool configuration lock is held when called.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1762
2013-10-10 16:56:51 -07:00
Brian Behlendorf 222b948059 Fix memory leak false positive in log_internal()
When building the spl with --enable-debug-kmem-tracking a memory
leak is detected in log_internal().  This happens to be a false
positive because the memory was freed using strfree() instead of
kmem_free().  All kmem_alloc()'s must be released with kmem_free()
to ensure correct accounting.

  SPL: kmem leaked 135/5641311 bytes
  address          size  data             func:line
  ffff8800cba7cd80 135   ZZZZZZZZZZZZZZZZ log_internal:456

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-09 09:16:36 -07:00
Brian Behlendorf 36342b13d9 Export addition dsl_prop_* symbols
The recent sync task restructuring in 13fe019 introduced several
new symbols which should be exported for use by consumers such
as Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-25 15:44:22 -07:00
Tim Chase 8769db3966 Allocate the ioctl "output" nvlist with KM_PUSHPAGE.
Some ZFS errors such as certain snapshot failures can occur in
the sync task context.  Because they may require additional memory
allocations, the initial nvlist must be allocated with KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
2013-09-25 15:44:22 -07:00
Tim Chase c5322236ec Fix several new KM_SLEEP warnings
A handful of allocations now occur in the sync path and need
to use KM_PUSHPAGE.  These were introduced by commit 13fe019.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
2013-09-25 15:44:22 -07:00
Brian Behlendorf cbfa294de4 Fix spa_deadman() TQ_SLEEP warning
The spa_deadman() and spa_sync() functions can both be run in the
spa_sync context and therefore should use TQ_PUSHPAGE instead of
TQ_SLEEP.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1734
Closes #1749
2013-09-25 15:38:44 -07:00
GregorKopka f9f3f1ef98 Removing unneeded mutex for reading vq_pending_tree size
Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded:

* no data is modified
* only vq_pending_tree is read
* in case garbage is returned (eg. vq_pending_tree being updated
  while the read is made) the worst case would be that a single
  read could be queued on a mirror side which more busy than thought

The benefit of this change is streamlining of the code path since
it is taken for *every* mirror member on *every* read.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1739
2013-09-25 15:29:45 -07:00
Kohsuke Kawaguchi 77831e1738 Reduce the stack usage of dsl_dataset_remove_clones_key
dataset_remove_clones_key does recursion, so if the recursion goes
deep it can overrun the linux kernel stack size of 8KB. I have seen
this happen in the actual deployment, and subsequently confirmed it by
running a test workload on a custom-built kernel that uses 32KB stack.

See the following stack trace as an example of the case where it would
have run over the 8KB stack kernel:

        Depth    Size   Location    (42 entries)
        -----    ----   --------
  0)    11192      72   __kmalloc+0x2e/0x240
  1)    11120     144   kmem_alloc_debug+0x20e/0x500
  2)    10976      72   dbuf_hold_impl+0x4a/0xa0
  3)    10904     120   dbuf_prefetch+0xd3/0x280
  4)    10784      80   dmu_zfetch_dofetch.isra.5+0x10f/0x180
  5)    10704     240   dmu_zfetch+0x5f7/0x10e0
  6)    10464     168   dbuf_read+0x71e/0x8f0
  7)    10296     104   dnode_hold_impl+0x1ee/0x620
  8)    10192      16   dnode_hold+0x19/0x20
  9)    10176      88   dmu_buf_hold+0x42/0x1b0
 10)    10088     144   zap_lockdir+0x48/0x730
 11)     9944     128   zap_cursor_retrieve+0x1c4/0x2f0
 12)     9816     392   dsl_dataset_remove_clones_key.isra.14+0xab/0x190
 13)     9424     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 14)     9032     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 15)     8640     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 16)     8248     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 17)     7856     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 18)     7464     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 19)     7072     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 20)     6680     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 21)     6288     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 22)     5896     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 23)     5504     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 24)     5112     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 25)     4720     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 26)     4328     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 27)     3936     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 28)     3544     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 29)     3152     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 30)     2760     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 31)     2368     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 32)     1976     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 33)     1584     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 34)     1192     232   dsl_dataset_destroy_sync+0x311/0xf60
 35)      960      72   dsl_sync_task_group_sync+0x12f/0x230
 36)      888     168   dsl_pool_sync+0x48b/0x5c0
 37)      720     184   spa_sync+0x417/0xb00
 38)      536     184   txg_sync_thread+0x325/0x5b0
 39)      352      48   thread_generic_wrapper+0x7a/0x90
 40)      304     128   kthread+0xc0/0xd0
 41)      176     176   ret_from_fork+0x7c/0xb0

This change reduces the stack usage in dsl_dataset_remove_clones_key
by allocating structures in heap, not in stack.  This is not a fundamental
fix, as one can create an arbitrary large data set that runs over any
fixed size stack, but this will make the problem far less likely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Closes #1726
2013-09-25 15:18:32 -07:00
Brian Behlendorf 34d5a5fd03 Fix zpl_mknod() return values
The zpl_mknod() function was incorrectly negating its return value.
This doesn't cause any problems in the success case, but it does
prevent us from returning the correct error code for a failure.
The implementation of this function is now consistent with all
the other zpl_* functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1717
2013-09-13 13:31:24 -07:00
Brian Behlendorf 17897ce2c8 Fix uninitialized variables
When compiling on an ARM device using gcc 4.7.3 several variables
in the zfs_obj_to_path_impl() function were flagged as uninitialized.
To resolve the warnings explicitly initialize them to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1716
2013-09-13 13:31:24 -07:00
Tim Chase 4cf652e5d4 Fix dmu_objset_find_dp() KM_SLEEP warning
After the restructuring in 13fe019 The 'zfs rename' command will
result in a KM_SLEEP being called in the sync context.  This may
deadlock due to reclaim so it was changed to KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1711
2013-09-11 11:49:32 -07:00
Matthew Ahrens 13fe019870 Illumos #3464
3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3464
  illumos/illumos-gate@3b2aab1880

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1495
2013-09-04 16:01:24 -07:00
Matthew Ahrens 6f1ffb0665 Illumos #2882, #2883, #2900
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2882
  https://www.illumos.org/issues/2883
  https://www.illumos.org/issues/2900
  illumos/illumos-gate@4445fffbbb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1293

Porting notes:

WARNING: This patch changes the user/kernel ABI.  That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules.  Ensure you load the matching kernel
modules from master after updating the utilities.  Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:

  $ zpool list
  failed to read pool configuration: bad address
  no pools available

  $ zfs list
  no datasets available

Add zvol minor device creation to the new zfs_snapshot_nvl function.

Remove the logging of the "release" operation in
dsl_dataset_user_release_sync().  The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos.  This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).

Squash some "may be used uninitialized" warning/erorrs.

Fix some printf format warnings for %lld and %llu.

Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.

Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
2013-09-04 15:49:00 -07:00
Brian Behlendorf 6a7c0ccca4 Use directory xattrs for symlinks
There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.

Fortunately, the only existing use case for this are symlinks with
SA based xattrs.  Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1468
2013-08-22 13:30:44 -07:00
Brian Behlendorf c273d60d80 Revert "Evict meta data from ghost lists + l2arc headers"
This reverts commit fadd0c4da1 which
introduced a regression in honoring the meta limit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Close #1660
2013-08-22 12:15:37 -07:00
Richard Yao 0f37d0c8be Linux 3.11 compat: fops->iterate()
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.

To handle this the code was reworked to use the new ->iterate
interface.  Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.

Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions.  Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.

Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function.  The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names.  Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1653
Issue #1591
2013-08-15 16:19:07 -07:00
Brian Behlendorf 34e143323e Fix z_wr_iss_h zio_execute() import hang
Because we need to be more frugal about our stack usage under
Linux.  The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy.  Those two contexts are the sync
thread and what ever thread is performing spa initialization.

Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed.  If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again.  The system
will get stuck spinning re-dispatching the zio and making no
forward progress.

To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.

In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1455
2013-08-15 15:20:36 -07:00
Matthew Ahrens cb682a173a Illumos #3618 ::zio dcmd does not show timestamp data
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  http://www.illumos.org/issues/3618
  illumos/illumos-gate@c55e05cb35

Notes on porting to ZFS on Linux:

The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.

One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:

    zfs_vdev_time_shift = 6
        to
    zfs_vdev_time_shift = 29

The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.

(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)

Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1643
2013-08-12 16:46:50 -07:00
Richard Yao 570d6edf1d Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1589
2013-08-09 15:31:52 -07:00
Brian Behlendorf fadd0c4da1 Evict meta data from ghost lists + l2arc headers
When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists.  Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.

To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.

If this is insufficient we request that the VFS release
some inodes and dentries.  This will result in the release
of some dnodes which are counted as 'other' metadata.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-09 10:06:12 -07:00
Brian Behlendorf 68121a03da Allow arc_evict_ghost() to only evict meta data
The default behavior of arc_evict_ghost() is to start by evicting
data buffers.  Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.

This is ideal for honoring arc_c since it's preferable to keep the
meta data cached.  However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.

To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with.  The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.

All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change.  New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-09 10:06:08 -07:00
Saso Kiselkov 3a17a7a99a Illumos #3137 L2ARC compression
3137 L2ARC compression
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@aad02571bc
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H <james@kagisoft.co.uk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1379
2013-08-08 13:27:21 -07:00
Richard Yao c11a12bc3b Return -1 from arc_shrinker_func()
This is analogous to SPL commit zfsonlinux/spl@b9b3715.  While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1579
2013-08-08 09:20:56 -07:00
Richard Yao 8170d28126 Return correct type and offset from zfs_readdir
zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.

On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is  giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop.  The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.

Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.

Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1624
2013-08-07 16:16:43 -07:00
George Wilson c61f97f426 Illumos #3639 zpool.cache should skip over readonly pools
3639 zpool.cache should skip over readonly pools
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  illumos/illumos-gate@fb02ae0252
  https://www.illumos.org/issues/3639

Normally we don't list pools that are imported read-only in the cache
file, however you can accidentally get one into the cache file by
importing and exporting a read-write pool while a read-only pool is
imported:

$ zpool import -o readonly test1
$ zpool import test2
$ zpool export test2
$ zdb -C

This is a problem because if the machine reboots we import all pools in
the cache file as read-write.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-07 16:13:56 -07:00
Brian Behlendorf 78d7a5d780 Write dirty inodes on close
When the property atime=on is set operations which only access
and inode do cause an atime update.  However, it turns out that
dirty inodes with updated atimes are only written to disk when
the inodes get evicted from the cache.  Somewhat surprisingly
the source suggests that this isn't a ZoL specific issue.

This behavior may in part explain why zfs's reclaim logic has
been observed to be slow.  When reclaiming inodes its likely
that they have a dirty atime which will force a write to disk.

Obviously we don't want to force a write to disk for every
atime update, these needs to be batched.  The right way to
do this is to fully implement the .dirty_inode and .write_inode
callbacks.  However, to do that right requires proper unification
of some fields in the znode/inode.  Then we could just mark the
inode dirty and leave it to the VFS to call .write_inode
periodically.

Until that work gets done we have to settle for some middle
ground.  The simplest and safest thing we can do for now is
to write the dirty inode on last close.  This should prevent
the majority of inodes in the cache from having dirty atimes
and not drastically increase the number of writes.

Some rudimentally testing to show how long it takes to drop
500,000 inodes from the cache shows promising results.  This
is as expected because we're no longer do lots of IO as part
of the eviction, it was done earlier during the close.

w/out patch: ~30s to drop 500,000 inodes with drop_caches.
with patch:  ~3s to drop 500,000 inodes with drop_caches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-07 16:11:19 -07:00
Brian Behlendorf 57b650b86f Export additional dmu symbols
The dmu_prefetch, dmu_free_long_range, dmu_free_object,
dmu_prealloc, dmu_write_policy, and dmu_sync symbols have
been exported so they may be used by other modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-01 09:48:07 -07:00
Nathaniel Clark 7d63721118 dmu_tx: Fix possible NULL pointer dereference
dmu_tx_hold_object_impl can return NULL on error.  Check for this
condition prior to dereferencing pointer.  This can only occur if
the passed object was invalid or unallocated.

Signed-off-by: Nathaniel Clark <Nathaniel.Clark@misrule.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1610
2013-08-01 09:48:07 -07:00
Richard Yao cb543e6b5e Remove b_thawed from arc_buf_hdr_t
The code involving b_thawed appears to be dead, so lets discard it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
2013-08-01 09:48:07 -07:00
Richard Yao 3f4058cd15 Remove arc_data_buf_alloc()/arc_data_buf_free()
These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
2013-08-01 09:48:07 -07:00
Richard Yao 4edbd2f79a Remove zio_alloc_arena
We declare zio_alloc_arena using extern, but it does not appear to exist
anywhere in the code. This permits undefined behavior, so lets remove
it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
2013-08-01 09:48:06 -07:00
Brian Behlendorf bce45ec9fb Make arc+l2arc module options writable
The l2arc module options can be made safely writable.  This allows
the options to be changed without unloading/loading the modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-30 15:40:20 -07:00
Brian Behlendorf c93504f03a Change l2arc_norw default to zero
These days modern SSDs can efficiently service concurrent reads
and writes.  When this flag was added that wasn't really the
case for a variety of SSD controllers.  But now we can set the
default value to take advantage of this parallelism and only
disable this as needed for specific troublesome hardware.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-29 22:05:32 -07:00
Ying Zhu 6e1d7276c9 Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:

1. When allocate a l2arc_buf_hdr_t we add its memory consumption
   instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
   the buffer only stays in l2arc we should also add the memory
   its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
   arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
   in step 1 and the same for transfering arc bufs from l2arc only
   state.

The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:

1. Fdisked a SATA disk into two partitions, one partition for zpool
   storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
   prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
   in /proc/spl/kstat/zfs/arcstats. After the reading ended the
   l2_hdr_size and l2_size were shown like this:

      l2_size             4       4403780608
      l2_hdr_size         4       0

   which was weird.

4. After applying this patch and reran step 1-3, the results were
   as following:

      l2_size             4       4306443264
      l2_hdr_size         4       535600

   these numbers made sense, on 64-bit systems the
   sizeof(l2arc_buf_hdr_t) is 16 bytes.  Assue all blocks cached by
   l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
   blocks are equal-sized, the theoretical result will be a little
   bigger, as we can see.

Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:

probe module("zfs").function("arc_chage_state")
{
	if ($new_state == $arc_l2_only)
		printf("change arc buf to arc_l2_only\n")
}

It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only.  I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-29 22:05:26 -07:00
Brian Behlendorf dba1d70566 Fix arc_adapt() spinning in iterate_supers_type()
The iterate_supers_type() function which was introduced in the
3.0 kernel was supposed to provide a safe way to call an arbitrary
function on all super blocks of a specific type.  Unfortunately,
because a list_head was used a bug was introduced which made it
possible for iterate_supers_type() to get stuck spinning on a
super block which was just deactivated.

This can occur because when the list head is removed from the
fs_supers list it is reinitialized to point to itself.  If the
iterate_supers_type() function happened to be processing the
removed list_head it will get stuck spinning on that list_head.

The bug was fixed in the 3.3 kernel by converting the list_head
to an hlist_node.  However, to resolve the issue for existing
3.0 - 3.2 kernels we detect when a list_head is used.  Then to
prevent the spinning from occurring the .next pointer is set to
the fs_supers list_head which ensures the iterate_supers_type()
function will always terminate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1045
Closes #861
Closes #790
2013-07-17 09:28:06 -07:00
Brian Behlendorf c9ada6d5a0 Fix read-only pool hang on unmount
During mount a filesystem dataset would have the MS_RDONLY bit
incorrectly cleared even if the entire pool was read-only.
There is existing to code to handle this case but it was being run
before the property callbacks were registered.  To resolve the
issue we move this read-only code after the callback registration.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1338
2013-07-17 09:22:23 -07:00
Brian Behlendorf 76351672c2 Fix zfsctl_expire_snapshot() deadlock
It is possible for an automounted snapshot which is expiring to
deadlock with a manual unmount of the snapshot.  This can occur
because taskq_cancel_id() will block if the task is currently
executing until it completes.  But it will never complete because
zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which
zfsctl_expire_snapshot() must acquire.

---------------------- z_unmount/0:2153 ---------------------
  mutex_lock                <blocking on zsb->z_ctldir_lock>
  zfsctl_unmount_snapshot
  zfsctl_expire_snapshot
  taskq_thread

------------------------- zfs:10690 -------------------------
  taskq_wait_id             <waiting for z_unmount to exit>
  taskq_cancel_id
  __zfsctl_unmount_snapshot
  zfsctl_unmount_snapshot   <takes zsb->z_ctldir_lock>
  zfs_unmount_snap
  zfs_ioc_destroy_snaps_nvl
  zfsdev_ioctl
  do_vfs_ioctl

We resolve the deadlock by dropping the zsb->z_ctldir_lock before
calling __zfsctl_unmount_snapshot().  The lock is only there to
prevent concurrent modification to the zsb->z_ctldir_snaps AVL
tree.  Moreover, we're careful to remove the zfs_snapentry_t from
the AVL tree before dropping the lock which ensures no other tasks
can find it.  On failure it's added back to the tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes #1527
2013-07-12 10:06:53 -07:00
Brian Behlendorf 556011dbec Improve N-way mirror performance
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.

The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror.  It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them.  In practice this results in the
leaf vdevs being under utilized.

Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO.  This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror.  Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.

In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds.  Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.

The testing results show a significant performance improvement
for all four workloads tested.  The workloads were generated
using the fio benchmark and are as follows.

1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).

               | Pristine              |  With 1461             |
               | Sequential  Random    |  Sequential  Random    |
               | 1MB  4KB    1MB  4KB  |  1MB  4KB    1MB  4KB  |
               | MB/s MB/s   IO/s IO/s |  MB/s MB/s   IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped      | 226  243     11  304  |  222  255     11  299  |
2 2-Way Mirror | 302  324     16  534  |  433  448     23  571  |
2 3-Way Mirror | 429  458     24  714  |  648  648     41  808  |
2 4-Way Mirror | 562  601     36  849  |  816  828     82  926  |

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1461
2013-07-11 13:53:50 -07:00
Prakash Surya 92334b14ec Add new kstat for monitoring time in dmu_tx_assign
This change adds a new kstat to gain some visibility into the amount of
time spent in each call to dmu_tx_assign. A histogram is exported via
a new dmu_tx_assign_histogram-$POOLNAME file. The information contained
in this histogram is the frequency dmu_tx_assign took to complete given
an interval range. For example, given the below histogram file:

    $ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank
    12 1 0x01 32 1536 19792068076691 20516481514522
    name                            type data
    1 us                            4    859
    2 us                            4    252
    4 us                            4    171
    8 us                            4    2
    16 us                           4    0
    32 us                           4    2
    64 us                           4    0
    128 us                          4    0
    256 us                          4    0
    512 us                          4    0
    1024 us                         4    0
    2048 us                         4    0
    4096 us                         4    0
    8192 us                         4    0
    16384 us                        4    0
    32768 us                        4    1
    65536 us                        4    1
    131072 us                       4    1
    262144 us                       4    4
    524288 us                       4    0
    1048576 us                      4    0
    2097152 us                      4    0
    4194304 us                      4    0
    8388608 us                      4    0
    16777216 us                     4    0
    33554432 us                     4    0
    67108864 us                     4    0
    134217728 us                    4    0
    268435456 us                    4    0
    536870912 us                    4    0
    1073741824 us                   4    0
    2147483648 us                   4    0

one can see most calls to dmu_tx_assign completed in 32us or less, but a
few outliers did not. Specifically, 4 of the calls took between 262144us
and 131072us. This information is difficult, if not impossible, to gather
without this change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1584
2013-07-11 13:53:44 -07:00
Brian Behlendorf bf89c19914 Log pool suspension warnings to the console
In the event that a pool gets suspended log this information to
the console.  This is critical information and we want to make
sure it gets logged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1555
2013-07-10 15:15:52 -07:00
Brian Behlendorf abc41ac7c7 Use GFP_NOIO in vdev_disk_io_flush()
To avoid a potential deadlock when using a zvol as a swap
device prevent vdev_disk_io_flush() from performing IO during
the bio_alloc().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1508
2013-07-10 14:12:21 -07:00
Ying Zhu b4f7f10527 Improve code in arc_buf_remove_ref
When we remove references of arc bufs in the arc_anon state we
needn't take its header's hash_lock, so postpone it to where we
really need it to avoid unnecessary invocations of function buf_hash.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1557
2013-07-09 11:53:28 -07:00
Shen Yan 8e07b99b2f Update zio.c
The cv_wait_io is used to account io time instead of cv_wait.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1566
2013-07-09 10:41:46 -07:00
Brian Behlendorf 31455ab130 Add zfs_autoimport_disable tunable
There are times when it is desirable for zfs to not automatically
populate the spa namespace at module load time using the pools
in the /etc/zfs/zpool.cache file.  The zfs_autoimport_disable
module option has been added to control this behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #330
2013-07-09 10:11:19 -07:00
Chris Dunlop a1d9543a39 3.10 API change: block_device_operations->release() returns void
Linux kernel commit torvalds/linux@db2a144 changed the return type
of block_device_operations->release() to void.  Detect the expected
prototype and defined our callout accordingly.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1494
2013-07-08 15:41:57 -07:00
Brian Behlendorf 91604b298c Open pools asynchronously after module load
One of the side effects of calling zvol_create_minors() in
zvol_init() is that all pools listed in the cache file will
be opened.  Depending on the state and contents of your pool
this operation can take a considerable length of time.

Doing this at load time is undesirable because the kernel
is holding a global module lock.  This prevents other modules
from loading and can serialize an otherwise parallel boot
process.  Doing this after module inititialization also
reduces the chances of accidentally introducing a race
during module init.

To ensure that /dev/zvol/<pool>/<dataset> devices are
still automatically created after the module load completes
a udev rules has been added.  When udev notices that the
/dev/zfs device has been create the 'zpool list' command
will be run.  This then will cause all the pools listed
in the zpool.cache file to be opened.

Because this process in now driven asynchronously by udev
there is the risk of problems in downstream distributions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #756
Issue #1020
Issue #1234
2013-07-03 09:24:38 -07:00
Richard Yao 2a3871d4bc Cleanup zvol initialization code
The following error will occur on some (possibly all) kernels
because blk_init_queue() will try to take the spinlock before
we initialize it.

  BUG: spinlock bad magic on CPU#0, zpool/4054
   lock: 0xffff88021a73de60, .magic: 00000000,
   .owner: <none>/-1, .owner_cpu: 0
  Pid: 4054, comm: zpool Not tainted 3.9.3 #11
  Call Trace:
   [<ffffffff81478ef8>] spin_dump+0x8c/0x91
   [<ffffffff81478f1e>] spin_bug+0x21/0x26
   [<ffffffff812da097>] do_raw_spin_lock+0x127/0x130
   [<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30
   [<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350
   [<ffffffff812aacb8>] elevator_init+0x78/0x140
   [<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0
   [<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70
   [<ffffffff812b271e>] blk_init_queue+0xe/0x10
   [<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620
   [<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30
   [<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510
   [<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510
   [<ffffffff81253ea2>] zvol_create_minors+0x92/0x180
   [<ffffffff811f8d80>] spa_open_common+0x250/0x380
   [<ffffffff811f8ece>] spa_open+0xe/0x10
   [<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80
   [<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190
   [<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0
   [<ffffffff8116a950>] sys_ioctl+0x40/0x80
   [<ffffffff814812c9>] ? do_page_fault+0x9/0x10
   [<ffffffff81483929>] system_call_fastpath+0x16/0x1b
   zd0: unknown partition table

We fix this by calling spin_lock_init before blk_init_queue.

The manner in which zvol_init() initializes structures is
suspectible to a race between initialization and a probe on
a zvol. We reorganize zvol_init() to prevent that.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-03 09:23:35 -07:00
Pawel Jakub Dawidek 526af78550 Call zvol_create_minors() in spa_open_common() when initializing pool
There is an extremely odd bug that causes zvols to fail to appear on
some systems, but not others. Recently, I was able to consistently
reproduce this issue over a period of 1 month. The issue disappeared
after I applied this change from FreeBSD.

This is from FreeBSD's pool version 28 import, which occurred in
revision 219089.

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #441
Issue #599
2013-07-03 09:22:44 -07:00
George Wilson 294f68063b Illumos #3498 panic in arc_read()
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@1b912ec710
  https://www.illumos.org/issues/3498

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1249
2013-07-02 13:34:31 -07:00
Matthew Ahrens 96b89346c0 Illumos #3122 zfs destroy filesystem should prefetch blocks
3122 zfs destroy filesystem should prefetch blocks
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@b4709335aa
  https://www.illumos.org/issues/3122

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1565
2013-07-02 13:34:02 -07:00
Cyril Plisko 29dee3ee9a Add zfs_sync_pass_* tunable parameters
Commit 55d85d5a8c (backport of
the upstream changes) replaced three hardcoded constants:

    #define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */
    #define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
    #define SYNC_PASS_REWRITE       1 /* rewrite new bps after this pass */

with a tunable parameters:

    int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
    int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
    int zfs_sync_pass_rewrite = 2;       /* rewrite new bps starting in this pass */

This commit makes these tunables available as module parameters
in Linux.  They should only be used for performance analysis
because changing them can result in subtle and pathological
performance problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1562
2013-07-02 09:34:18 -07:00
Li Dongyang 802e7b5feb Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.

Tested with xfstests 285 and 286.  All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.

Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.

An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1384
2013-07-02 09:24:43 -07:00
Matthew Ahrens cf91b2b6b2 Readd zfs_holey() from OpenSolaris
This patch restores the zfs_holey() function from OpenSolaris.
This was removed by commit 3558fd7 because it wasn't clear we
had a use for it in ZoL.  However, this functionality is a
prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #1384
2013-07-02 09:24:18 -07:00
shenyan1 0a6bef26ec kmem_zalloc(..., KM_SLEEP) will never fail
By definitition these allocations will never fail.  For
consistency with the rest of the code remove this dead error
handling code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1558
2013-07-01 14:51:48 -07:00
Tim Chase ab68b6e5db Fix zfs_sb_teardown/zfs_resume_fs NULL dereference
Fix a pair of conditions in which a concurrent umount can cause
NULL pointer dereferences:

* zfs_sb_teardown - prevent a NULL dereference by not calling
                    dmu_objset_pool with a null z_os.

* zfs_resume_fs - don't try to unmount with a null z_os.  This
                  change makes the ZoL code more consistent
                  with both Illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1543
2013-07-01 14:51:45 -07:00
Ying Zhu c12936b141 Fix module probe failure on 32-bit systems
Previous commit 7ef5e54e2e caused
module probe failure on 32-bit systems, dmesg showed

  Unknown symbol __moddi3

This was caused by the modulo operation 'gethrtime() % tqs->stqs_count'
in the committed code.  Instead of implementing __moddi3 for all 32-bit
systems, Behlendorf advised we can just cast the return value of
gethrtime() into a uint64_t, since gethrtime does not return negative
value on all circumstances we need not care about the potential overflow.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1551
2013-06-27 10:01:25 -07:00
Brian Behlendorf 88c283952f Return -EOPNOTSUPP for ZFS_IOC_{GET|SET}FLAGS
Until these hooks are fully implemented return the expected
-EOPNOTSUPP error to indicate they are not functional.  This
allows test suites such as xfstests to cleanly skip testing
this functionality until it's implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #229
2013-06-26 15:20:13 -07:00
Brian Behlendorf 81eaf15107 Register correct handlers in nvlist_alloc()
The non-blocking allocation handlers in nvlist_alloc() would be
mistakenly assigned if any flags other than KM_SLEEP were passed.
This meant that nvlists allocated with KM_PUSHPUSH or other KM_*
debug flags were effectively always using atomic allocations.

While these failures were unlikely it could lead to assertions
because KM_PUSHPAGE allocations in particular are guaranteed to
succeed or block.  They must never fail.

Since the existing API does not allow us to pass allocation
flags to the private allocators the cleanest thing to do is to
add a KM_PUSHPAGE allocator.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#249
2013-06-20 09:58:15 -07:00
Matthew Ahrens df4474f92d Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@6e6d5868f5
  https://www.illumos.org/issues/3805

ZFS should proactively evict freed blocks from the cache.

On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk.  We were wasting about half the
system's RAM (252GB) on blocks that have been freed.

Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:

1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.

2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)

The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself).  The secondary control is to
evict data before evicting metadata.

Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks.  Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).

On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB.  However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless.  Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.

The problem of cache segmentation is a larger problem that needs more
investigation.  However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-20 09:55:52 -07:00
George Wilson e51be06697 Illumos #3552, #3564
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@16a4a80742
  https://www.illumos.org/issues/3552
  https://www.illumos.org/issues/3564

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1513
2013-06-19 16:22:39 -07:00
Madhav Suresh c99c90015e Illumos #3006
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
     argument is zero

Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@fb09f5aad4
  https://illumos.org/issues/3006

Requires:
  zfsonlinux/spl@1c6d149feb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1509
2013-06-19 15:14:10 -07:00
Brian Behlendorf 0377189b88 Only check directory xattr on ENOENT
When SA xattrs are enabled only fallback to checking the directory
xattrs when the name is not found as a SA xattr.  Otherwise, the SA
error which should be returned to the caller is overwritten by the
directory xattr errors.  Positive return values indicating success
will also be immediately returned.

In the case of #1437 the ERANGE error was being correctly returned
by zpl_xattr_get_sa() only to be overridden with ENOENT which was
returned by the subsequent unnessisary call to zpl_xattr_get_dir().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1437
2013-05-10 12:24:56 -07:00
Cyril Plisko 4f34b3bdf4 zfs_scrub_limit tunable is not used anywhere
As a part of scrub/resilver tuning zfs_scrub_limit fell out of use,
but the definition of the variable remained in place.
Moreover various guides still (misleadingly) mention it as a way
to influence resilver/scrub behavior.
This commit removes its finally.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1444
2013-05-06 14:14:06 -07:00
Ying Zhu ee664d4631 Fix incorrect assertions in ddt_phys_decref and ddt_sync_entry
The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt
from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the
reference count of zero blocks commonly available in spare files, we may
mistakenly hit these assertations, so drop the type conversions here.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1436
2013-05-06 14:10:55 -07:00
Brian Behlendorf 044baf009a Use taskq for dump_bytes()
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions.  These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with.  In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.

To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq.  This ensures that all I/O will be
performed with the majority of the stack space available.  This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1399
Closes #1423
2013-05-06 14:05:42 -07:00
Adam Leventhal 7ef5e54e2e Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contention
3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@ec94d32
  https://illumos.org/issues/3581

Notes for Linux port:

Earlier commit 08d08eb reduced contention on this taskq lock by simply
reducing the number of z_fr_iss threads from 100 to one-per-CPU.  We
also optimized the taskq implementation in zfsonlinux/spl@3c6ed54.
These changes significantly improved unlink performance to acceptable
levels.

This patch further reduces time spent spinning on this lock by
randomly dispatching the work items over multiple independent task
queues.  The Illumos ZFS developers stated that this lock contention
only arose after "3329 spa_sync() spends 10-20% of its time in
spa_free_sync_cb()" was landed.  It's not clear if 3329 affects the
Linux port or not.  I didn't see spa_free_sync_cb() show up in
oprofile sessions while unlinking large files, but I may just not
have used the right test case.

I tested unlinking a 1 TB of data with and without the patch and
didn't observe a meaningful difference in elapsed time.  However,
oprofile showed that the percent time spent in taskq_thread() was
reduced from about 16% to about 5%.  Aside from a possible slight
performance benefit this may be worth landing if only for the sake of
maintaining consistency with upstream.

Ported-by: Ned Bass <bass6@llnl.gov>
Closes #1327
2013-05-06 14:05:37 -07:00
George Wilson 55d85d5a8c Illumos #3329, #3330, #3331, #3335
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@01f55e48fb
  https://www.illumos.org/issues/3329
  https://www.illumos.org/issues/3330
  https://www.illumos.org/issues/3331
  https://www.illumos.org/issues/3335

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-06 12:39:34 -07:00
George Wilson 5853fe790d Illumos #3306, #3321
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
     page and help

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@31d7e8fa33
  https://www.illumos.org/issues/3306
  https://www.illumos.org/issues/3321

The vdev_file.c implementation in this patch diverges significantly
from the upstream version.  For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.

1) The vdev_disk.c code in ZoL has been rewritten to use the
   Linux block device interfaces which differ significantly
   from those in Illumos.  Therefore, updating the vdev_file.c
   to use the Illumos interfaces doesn't get you consistency
   with vdev_disk.c.

2) Using the upstream patch as is would requiring implementing
   compatibility code for those Solaris block device interfaces
   in user and kernel space.  That additional complexity could
   lead to confusion and doesn't buy us anything.

For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function.  This has the advantage of being
low risk and easy to understand.  Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.

Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased.  The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention.  We increase it to 8 until
the contention can be address by porting Illumos #3581.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1354
2013-05-03 16:53:52 -07:00
George.Wilson cc92e9d0c3 3246 ZFS I/O deadman thread
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation

*) Usage of the cyclic interface was replaced by the delayed taskq
   interface.  This avoids the need to implement new compatibility
   code and allows us to rely on the existing taskq implementation.

*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
   because declaring externs in source files as was done in the
   original patch is just plain wrong.

*) Instead of panicing the system when the deadman triggers a
   zevent describing the blocked vdev and the first pending I/O
   is posted.  If the panic behavior is desired Linux provides
   other generic methods to panic the system when threads are
   observed to hang.

*) For reference, to delay zios by 30 seconds for testing you can
   use zinject as follows: 'zinject -d <vdev> -D30 <pool>'

References:
  illumos/illumos-gate@283b84606b
  https://www.illumos.org/issues/3246

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1396
2013-05-01 17:05:52 -07:00
Brian Behlendorf 57f5a2008e Fix txg_quiesce thread deadlock
A deadlock was accidentally introduced by commit e95853a which
can occur when the system is under memory pressure.  What happens
is that while the txg_quiesce thread is holding the tx->tx_cpu
locks it enters memory reclaim.  In the context of this memory
reclaim it then issues synchronous I/O to a ZVOL swap device.
Because the txg_quiesce thread is holding the tx->tx_cpu locks
a new txg cannot be opened to handle the I/O.  Deadlock.

The fix is straight forward.  Move the memory allocation outside
the critical region where the tx->tx_cpu locks are held.  And for
good measure change the offending allocation to KM_PUSHPAGE to
ensure it never attempts to issue I/O during reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1274
2013-04-26 14:42:36 -07:00
Brian Behlendorf f706421173 Correctly return ERANGE in getxattr(2)
According to the getxattr(2) man page the ERANGE errno should be
returned when the size of the value buffer is to small to hold the
result.  Prior to this patch the implementation would just truncate
the value to size bytes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1408
2013-04-24 12:35:04 -07:00
Chris Dunlop 254255f735 Trivial spelling fix
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1411
2013-04-19 15:43:16 -07:00
Caleb James DeLisle 8f1e11b610 Remove .readdir from zpl_file_operations table
The zpl_readdir() function shouldn't be registered as part of
the zpl_file_operations table, it must only be part of the
zpl_dir_file_operations table.  By removing this callback
the VFS will now correctly return ENOTDIR when calling
getdents() on a file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1404
2013-04-19 15:36:47 -07:00
Martin Matuska b28e57cb82 Allow setting a lower ashift with -o ashift
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1381
Closes #1328
Issue #967
Issue #548
2013-04-12 10:50:46 -07:00
George Wilson 295304bed6 Illumos #3422, #3425
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@bda8819455
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1390
2013-04-12 09:01:36 -07:00
Jan Engelhardt ea0fcfc875 gitignore: anchor entries at their respective directory
.ko is specific to module, .m4 to config, etc.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:50:17 -07:00
Jan Engelhardt 4e95cc99b0 build: resolve orthographic and other grammatical errors
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:44:52 -07:00
Brian Behlendorf 5dc6af0eec Add zio_ddt_free()+ddt_phys_decref() error handling
The assumption in zio_ddt_free() is that ddt_phys_select() must
always find a match.  However, if that fails due to a damaged
DDT or some other reason the code will NULL dereference in
ddt_phys_decref().

While this should never happen it has been observed on various
platforms.  The result is that unless your willing to patch the
ZFS code the pool is inaccessible.  Therefore, we're choosing
to more gracefully handle this case rather than leave it fatal.

http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1308
2013-03-19 13:01:01 -07:00
Brian Behlendorf 30b92c1de6 Add metaslab_debug option
Enabling metaslab debugging will prevent space maps from being
automatically unloaded.  This can significantly increase the
memory footprint but being able to dynamically control this is
helpful for debugging and certain performance testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-18 16:47:43 -07:00
Richard Yao 1c24b699b0 Linux 3.9 compat: Undefine GCC_VERSION
The mainline kernel started defining GCC_VERSION with commit
torvalds/linux@3f3f8d2f48.
Unfortunately, LZ4 also defines this macro, but the two
defintions are incompatible. We undefine GCC_VERSION in lz4.c
to handle this.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1339
2013-03-06 15:48:48 -08:00
Ned Bass 92db59ca3b Refresh links to web site
A few files still refer to @behlendorf's private fork on
github.  Use the primary web site URL instead.  Two typos
are also corrected.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-06 15:46:41 -08:00
Brian Behlendorf d09f98a9a6 Add KMODDIR to install target
Provide a mechanism to control the directory name the modules
are installed in.  The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/zfs'.

Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name.  While we're here
change the default from 'addon/zfs' to 'extra' which is the
kernel.org default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-06 15:46:40 -08:00
Eric Dillmann 0b4d1b5853 Add snapdev=[hidden|visible] dataset property
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices.  By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/.  When set to
'visible' all zvol snapshots for the dataset will be visible.

This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created.  When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions.  This leads
to a variety of issues:

1) The zvol partition tables will be read in the context of
   the `modprobe zfs` for automatically imported pools.  This
   is undesirable and should be done asynchronously, but for
   now reducing the number of visible devices helps.

2) Udev expects to be able to complete its work for a new
   block devices fairly quickly.  When many zvol devices are
   added at the same time this is no longer be true.  It can
   lead to udev timeouts and missing /dev/zvol links.

3) Simply having lots of devices in /dev/ can be aukward from
   a management standpoint.  Hidding the devices your unlikely
   to ever use helps with this.  Any snapshot device which is
   needed can be made visible by changing the snapdev property.

NOTE: This patch changes the default behavior for zvols which
      was effectively 'snapdev=visible'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1235
Closes #945
Issue #956
Issue #756
2013-03-05 12:37:54 -08:00
George Wilson a4430fce69 Merge zvol.c changes from PSARC 2010/306 Read-only ZFS pools
The changes to zvol.c were never merged from the last onnv_147
bulk update.  This was because zvol.c was largely rewritten
for Linux making it fairly easy to miss these sorts of changes.

This causes a regression when importing a zpool with zvols
read-only.  This does not impact pool which only contain
filesystem datasets.

References:
  illumos/illumos-gate@f9af39b

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1332
Closes #1333
2013-03-04 09:56:13 -08:00
Richard Yao b01615d5ac Constify structures containing function pointers
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.

Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:

vdev_opts_t             52
zil_replay_func_t       24
zio_compress_info_t     24
zio_checksum_info_t     9
space_map_ops_t         7
arc_byteswap_func_t     5

The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1300
2013-03-04 08:49:32 -08:00
Brian Behlendorf 8128bd89fb Fix hot spares
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL).  On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.

This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable.  It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.

To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.

1) Functions which open the device but only read had the
   O_EXCL flag removed and were updated to use O_RDONLY.

2) A common holder was added to the vdev disk code.  This
   allow the ZFS code to internally open the device multiple
   times but non-ZFS callers may not.

3) An exception was added to make_disks() for hot spare when
   creating partition tables.  For hot spare devices which
   are already opened exclusively we skip creating the partition
   table because this must already have been done when the disk
   was originally added as a hot spare.

Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix.  And is_spare() was moved
above make_disks() to avoid adding a forward reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #250
2013-03-01 13:31:02 -08:00
Brian Behlendorf bd99a7584a Remove wholedisk check from vdev_disk_open()
As described by the comment and enforced the by assertion the
v->vdev_wholedisk will never be -1.  The wholedisk handling
is performed by the user space utilities.  To prevent confusion
this dead code is being removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-28 12:02:59 -08:00
Brian Behlendorf 0d8103d956 Leaf vdevs should not be reopened
When vdev_disk.c was implemented for Linux we failed to handle the
reopen case.  According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set.  Under Linux
we would always close and open the device.

This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing.  The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links.  The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.

This patch adds the missing functionality in a similiar fashion to
to the Illumos code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-28 12:02:59 -08:00
Etienne Dechamps d9b0ebbe82 Remove the bio_empty_barrier() check.
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1318
2013-02-24 10:22:34 -08:00
Brian Behlendorf 546c978bbd Enable zfs_arc_memory_throttle_disable by default
The zfs_arc_memory_throttle_disable module option was introduced
by commit 0c5493d470 to resolve a
memory miscalculation which could result in the txg_sync thread
spinning.

When this was first introduced the default behavior was left
unchanged until enough real world usage confirmed there were no
unexpected issues.  We've now reached that point.  Linux's
direct reclaim is working as expected so we're enabling this
behavior by default.

This helps pave the way to retire the spl_kmem_availrmem()
functionality in the SPL layer.  This was the only caller.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #938
2013-02-21 13:38:24 -08:00
Richard Yao 8dca0a9a38 Make spa.c assertions catch unsupported pre-feature flag pool versions
A couple of assertions in spa.c were designed to prevent the use of
invalid pool versions. They were written under the assumption
that all valid pools are less than SPA_VERSION. Since feature flags
jumped from 28 to 5000, any numbers in the range 28 to 5000
non-inclusive will fail to trigger them.  We switch to the new
SPA_VERSION_IS_SUPPORTED macro to correct this.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1282
2013-02-12 10:27:44 -08:00
Brian Behlendorf 9878a89d7a Add explicit MAXNAMELEN check
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN.  It does however
appear to correctly handle MAXPATHLEN for us.

The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function.  Several in-tree filesystems
handle this case the same way.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1279
2013-02-12 10:27:39 -08:00
Ned Bass ed2e157605 Switch KM_SLEEP to KM_PUSHPAGE
Two more locations where KM_SLEEP was used in a call which must
use KM_PUSHPAGE were found while using the zpool upgrade command.
See commit b8d06fc for additional details.

Also make a small correction to the comment block above
dsl_dir_open_spa().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1268
2013-02-06 11:19:58 -08:00
Brian Behlendorf dd26aa535b Cast 'zfs bad bloc' to ULL for x86
Explicitly case this value to an unsigned long long for 32-bit
systems to inform the compiler that a long type should not be
used.  Otherwise we get the following compiler error:

  dmu_send.c:376: error: integer constant is too large for
  ‘long’ type

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-04 16:39:08 -08:00
Brian Behlendorf 0c5493d470 Add zfs_arc_memory_throttle_disable module option
The way in which virtual box ab(uses) memory can throw off the
free memory calculation in arc_memory_throttle().  The result is
the txg_sync thread will effectively spin waiting for memory to
be released even though there's lots of memory on the system.

To handle this case I'm adding a zfs_arc_memory_throttle_disable
module option largely for virtual box users.  Setting this option
disables free memory checks which allows the txg_sync thread to
make progress.

By default this option is disabled to preserve the current
behavior.  However, because Linux supports direct memory reclaim
it's doubtful throttling due to perceived memory pressure is ever
a good idea.  We should enable this option by default once we've
done enough real world testing to convince ourselve there aren't
any unexpected side effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #938
2013-02-01 11:17:14 -08:00
Brian Behlendorf 1f7c30df8f Add zfs_disable_dup_eviction module option
Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable.
It should have been made available as a module option in the
original patch but was overlooked.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-01 09:57:57 -08:00
Ned Bass 36f86f73f6 Fix mismatch between SA header size and layout
When a system attribute layout is created an inconsistency may occur
between the system attribute header (sa_hdr_phys_t) size and the
variable-sized attribute count stored in the layout.  The inconsistency
results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT
returns false:

SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab())
ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr,
tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) &&
hdr->sa_layout_info == 0)) failed

The bug originates in this snippet from sa_find_sizes().

    if (is_var_sz && var_size > 1) {
            if (P2ROUNDUP(hdrsize + sizeof (uint16_t),
                *total < full_space) {
                    hdrsize += sizeof (uint16_t);

This assumes that the current variable-sized attribute will be stored in
the current buffer and accounts for the space needed to store its size
in the sa_hdr_phys_t. However if the next attribute spills over we need
to store a blkptr_t at the end of the bonus buffer to point to the spill
block. If the current attribute is in the way of the blkptr_t then it
too will be relocated into the spill block. But since we've already
accounted for it in the header size we get the inconsistency described
above.

To avoid this, record the index of the last variable-sized attribute
that prompted a hdrsize increase, and reverse the increase if we later
determine that that attribute will be relocated to the spill block.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1250
2013-01-31 10:31:19 -08:00
Ned Bass 67629d0f08 Fix rounding discrepancy in sa_find_sizes()
A rounding discrepancy exists between how sa_build_layouts() and
sa_find_sizes() calculate when the spill block needs to be kicked in.
This results in a narrow size range where sa_build_layouts() believes
there must be a spill block allocated but due to the discrepancy there
isn't.  A panic then occurs when the hdl->sa_spill NULL pointer is
dereferenced.

The following reproducer for this bug was isolated:

    truncate -s 128m /tmp/tank
    zpool create tank /tmp/tank
    zfs create -o xattr=sa tank/fish
    ln -s `perl -e 'print "z" x 41'` /tank/fish/z
    setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z

This test results in roughly the following system attribute (SA)
layout:

  176 bytes - "standard" SA's
   41 bytes - name of symbolic link target
  100 bytes - XDR encoded nvlist for xattr
  ---
  317 bytes - total

Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes()
decides no spill block is needed. But sa_build_layouts() rounds 41 up
to 48 when computing the space requirements so it tries to switch to
the spill block.

Note that we were only able to reproduce this bug using a combination
of symbolic links and the Linux-specific xattr=sa dataset property.
So while this issue is not technically Linux-specific, it may be
difficult or impossible to hit the narrow size range needed to
reproduce it on other platforms.

To fix the discrepancy, round the running total in sa_find_sizes() up
to an 8-byte boundary before accounting for each SA, since this is how
they will be stored in the bonus and (possibly) spill buffers.

To make the intent of the code more clear, explicitly assert key
assumptions about expected alignment of data and whether spill-over
will occur.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1240
2013-01-31 10:31:13 -08:00
Adam H. Leventhal 89103a2643 Illumos #3447 improve the comment in txg.c
3447 improve the comment in txg.c

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@adbbcfface
  https://www.illumos.org/issues/3447

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-30 08:55:20 -08:00
Eric Dillmann 9759c60f1a Illumos #3035 LZ4 compression support in ZFS and GRUB
3035 LZ4 compression support in ZFS and GRUB

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>

References:
  illumos/illumos-gate@a6f561b4ae
  https://www.illumos.org/issues/3035
  http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS

This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux.  Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.

Support for GRUB has been dropped from this patch.  That code
is available but those changes will need to made to the upstream
GRUB package.

Lastly, several hunks of dead code were dropped for clarity.  They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.

Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1217
2013-01-29 09:28:20 -08:00
Chris Wedgwood ddc07fa57a Avoid gcc -Werror=maybe-uninitialized warnings
Explicitly set acl details to zero to silence gcc (zfs_acl_node_read
can't be sure zfs_acl_znode_info will set acl_count and aclsize).
Normally suppressing these warnings by setting this to zero at
declaration time is a bad idea but in this instance it's hard to
avoid and should be fairly safe.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1244
2013-01-28 09:10:29 -08:00
Brian Behlendorf 6772fb679a Use dsl_dataset_snap_lookup()
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation.  There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.

https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1238
2013-01-25 15:07:40 -08:00
Brian Behlendorf bf01b5e616 Add d_clear_d_op() compatibility
Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table.  This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1230
2013-01-23 16:33:29 -08:00
Ned Bass 1305d33a4b fzap_cursor_move_to_key() should drop l_rwlock
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success.  All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not.  This may block writers or cause VERIFY failures when the
lock is freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1215
Closes zfsonlinux/spl#143
Closes zfsonlinux/spl#97
2013-01-23 16:31:16 -08:00
Brian Behlendorf 09a661e960 Fix zpl_revalidate() NULL deref
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter.  In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.

Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1226
2013-01-22 09:38:17 -08:00
Brian Behlendorf ee93035378 Use sb->s_d_op default dentry operations
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1223
2013-01-18 15:04:23 -08:00
Massimo Maggi babf3f9b6d Fix zpool on zvol deadlock
Commit 65d56083b4 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection".  By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1220
2013-01-18 09:44:55 -08:00
Brian Behlendorf 7973e464de Revert "Revert "Fix unlink/xattr deadlock""
This reverts commit 53c7411919
effectively reinstating the asynchronous xattr cleanup code.

These Linux changes were reverted because after testing
and careful contemplation I was convinced that due to the
89260a1c8851ce05ea04b23606ba438b271d890 commit they were no
longer required.

Unfortunately, the deadlock described in #1176  was a case
which wasn't considered.  At mount zfs_unlinked_drain() can
occur which will unlink a list of znodes in effectively a
random order which isn't safe.  The only reason it was safe
to originally revert this change was the we could guarantee
that the VFS would always prune the xattr leaves before the
parents.

Therefore, until we can cleanly resolve this deadlock for
all cases we need to keep this change in spite of the xattr
unlink performance penalty associated with it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1176
Issue #457
2013-01-17 11:24:20 -08:00
Brian Behlendorf 7b3e34ba5a Fix 'zfs rollback' on mounted file systems
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL.  The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.

Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back.  Then a new inode with the same inode number would
be create and collide with the existing cached inode.  Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.

Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.

The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly.  If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.

This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled.  The inode
will then only be accessible from the cached dentries.  Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated.  Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.

Special care is taken for negative dentries since they do not
reference any inode.  These dentires will be invalidate based
on when they were added to the dentry cache.  Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.

Two nice side effects of this fix are:

* Removes the dependency on spl_invalidate_inodes(), it can now
  be safely removed from the SPL when we choose to do so.

* zfs_znode_alloc() no longer requires a dentry to be passed.
  This effectively reverts this portition of the code to its
  upstream counterpart.  The dentry is not instantiated more
  correctly in the Linux ZPL layer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #795
2013-01-17 09:51:20 -08:00
Ned Bass f1a05fa114 Fix false ENOENT on snapshot control dentries
Lookups in the snapshot control directory for an existing snapshot
fail with ENOENT if an earlier lookup failed before the snapshot was
created.  This is because the earlier lookup causes a negative dentry
to be cached which is never invalidated.

The bug can be reproduced as follows (the second ls should succeed):

 $ ls /tank/.zfs/snapshot/s
 ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
 $ zfs snap tank@s
 $ ls /tank/.zfs/snapshot/s
 ls: cannot access /tank/.zfs/snapshot/s: No such file or directory

To remedy this, always invalidate cached dentries in the snapshot
control directory.  Since these entries never exist on disk there is
no significant performance penalty for the extra lookups.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1192
2013-01-16 16:28:54 -08:00
Ned Bass 94a9bb4709 Fix quoting error in unmount command
A misplaced single quote caused the umount command to fail with a
syntax error when unmounting snapshots under the .zfs/snapshot
control directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1210
2013-01-16 15:30:47 -08:00
Christopher Siden b077fd4c4e Illumos #3189 kernel panic in test hotspare_onoffline_004_neg
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@8f0b538d1d
  changeset: 13818:e9ad0a945d45
  https://www.illumos.org/issues/3189

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-14 10:34:53 -08:00
Arne Jansen ff80d9b142 Illumos #1862 incremental zfs receive fails for sparse file > 8PB
1862 incremental zfs receive fails for sparse file > 8PB

Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Simon Klinkert <klinkert@webgods.de>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@31495a1e56
  illumos changeset: 13789:f0c17d471b7a
  https://www.illumos.org/issues/1862

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-14 10:34:41 -08:00
Matthew Ahrens a94addd974 Illumos #3208 cross-endian incorrect user/group accounting
3208 moving zpool cross-endian results in incorrect user/group
accounting

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@e828a46d29
  illumos changeset: 13835:eea81edc4f14
  https://www.illumos.org/issues/3208

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #627
Closes #1136
2013-01-14 09:32:22 -08:00
Bart Coddens 5c83989071 Illumos #2618 arc.c mistypes in the comments
2618 arc.c mistypes in the comments

Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@fc98fea58e
  illumos changeset: 13721:5b51a16a186f
  https://www.illumos.org/issues/2618

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-11 09:16:59 -08:00
Ned Bass 761394b3af call_usermodehelper() should wait for process
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.

One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #816
2013-01-09 16:54:52 -08:00
Brian Behlendorf 1c50c992ba Revert "Avoid ELOOP on auto-mounted snapshots"
This reverts commit 7afcf5b1da which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot.  It updates the vfsmount such that it references the
original dataset vfsmount.  The result is that the snapshot itself
isn't visible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
2013-01-09 11:24:47 -08:00
Brian Behlendorf 4cec9b2dc7 Only reduce __zio_execute() stack usage in kernel space
Related to 91579709fc we need to
be very careful about not overrunning the stack in kernel space.
However, in user space we're already allowing slightly larger
stacks so this stack usage optimization is not required there.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-09 10:34:35 -08:00
George Wilson 1eb5bfa3dc Illumos #3145, #3212
3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos-gate/commit/9253d63df408bb48584e0b1abfcc24ef2472382e
  illumos changeset: 13840:97fd5cdf328a
  https://www.illumos.org/issues/3145
  https://www.illumos.org/issues/3212

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #989
Closes #1137
2013-01-08 10:35:44 -08:00
Matthew Ahrens 753c38392d Illumos #3104: eliminate empty bpobjs
3104 eliminate empty bpobjs
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@f174573681
  illumos changeset: 13782:8f78aae28a63
  https://www.illumos.org/issues/3104

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Brian Behlendorf 91579709fc Fix __zio_execute() asynchronous dispatch
To save valuable stack all zio's were made asynchronous when in the
tgx_sync_thread context or during pool initialization.  See commit
2fac4c2 for the original patch and motivation.

Unfortuantely, the changes to dsl_pool_sync_context() made by the
feature flags broke this logic causing in __zio_execute() to dispatch
itself infinitely when called during pool initialization.  This
commit refines the existing logic to specificly target only the two
cases we care about.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson ea0b2538cd Illumos #3349: zpool upgrade -V bumps the on disk version number
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@25345e4666
  https://www.illumos.org/issues/3349

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Matthew Ahrens 29809a6cba Illumos #3086: unnecessarily setting DS_FLAG_INCONSISTENT on async
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async
destroyed datasets
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@ce636f8b38
  illumos changeset: 13776:cd512c80fd75
  https://www.illumos.org/issues/3086

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Christopher Siden b9b24bb4ca Illumos #2762: zpool command should have better support for feature flags
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@57221772c3
  https://www.illumos.org/issues/2762

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson 3bc7e0fb0f Illumos #3090 and #3102
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@dfbb943217
  illumos changeset: 13777:b1e53580146d
  https://www.illumos.org/issues/3090
  https://www.illumos.org/issues/3102

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #939
2013-01-08 10:35:42 -08:00
Christopher Siden 9ae529ec5d Illumos #2619 and #2747
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@53089ab7c8
  illumos/illumos-gate@ad135b5d64
  illumos changeset: 13700:2889e2596bd6
  https://www.illumos.org/issues/2619
  https://www.illumos.org/issues/2747

NOTE: The grub specific changes were not ported.  This change
must be made to the Linux grub packages.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:35 -08:00
Ned Bass 37f000c5aa Fix gcc array subscript above bounds warning
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c

    } else {
            int i;
            ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
            /* the blkptrs we are losing better be unallocated */
            for (i = dn->dn_next_nblkptr[txgoff];
                i < dnp->dn_nblkptr; i++)
                    ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));

This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning.  To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body.  Since a dnode contains
at most 3 block pointers this overhead should be negligible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #950
2013-01-07 11:21:52 -08:00
Matt Johnston 72938d6905 Use cv_wait_io() which will will account for iowait
Update zio_wait() to use cv_wait_io() to ensure the iowait time
is properly accounted for.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:52:52 -08:00
Matt Johnston 72f53c5694 Revert part of "Log I/Os longer than zio_delay_max (30s default)"
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.

The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device.  This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.

This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely  on io_schedule() via cv_wait_io().

The additional debugging information which was added to the
delay event has been intentionally left in place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:51:04 -08:00
Brian Behlendorf 65d56083b4 Fix zpool on zvol lock inversion deadlock
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock.  But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.

To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock().  Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.

When it is contended we risk a lock inversion if we were to
block waiting for the lock.  Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again.  This process can be repeated safely until both
locks are acquired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #612
2012-12-20 09:57:39 -08:00
Brian Behlendorf d5446cfc52 Revert "Remove TSD zfs_fsyncer_key"
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-20 09:56:28 -08:00
Brian Behlendorf 31f2b5abdf Remove TSD zfs_fsyncer_key
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit().  This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.

However, under Linux I'm not seeing the same contention that
was originally described.  Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD.  If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.

This just leaves one small ZFS TSD consumer.  If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#174
2012-12-19 09:08:01 -08:00
Prakash Surya 84daaddedb Set elevator for DM devices despite vdev_wholedisk
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:

    $ ls -d /dev/dm-*
    /dev/dm-0
    /dev/dm-1
    /dev/dm-2
    /dev/dm-3

it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:

    $ ls -d /dev/sd*
    /dev/sda
    /dev/sda1
    /dev/sdb
    /dev/sdb1

Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.

As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.

This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1149
2012-12-18 15:12:40 -08:00
Jorgen Lundman 6c2856726f Fix using zvol as slog device
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented.  This prevented
a zpool from use a zvol for its slog device.

This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c.  Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system.  If
they match we know the device is a zvol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1131
2012-12-18 11:02:28 -08:00
Brian Behlendorf 8780c53961 Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140
2012-12-14 12:18:54 -08:00
Ned Bass 7afcf5b1da Avoid ELOOP on auto-mounted snapshots
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns.  Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #816
2012-12-13 08:57:11 -08:00
Brian Behlendorf 2ae1031962 Linux 3.7 compat, schedule_delayed_work()
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers.   While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.

I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel.  But frankly
the compatibility code to use this kernel interface has always
been problematic.

Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1053
2012-12-12 10:47:05 -08:00
Richard Yao e4d89e9cfc Switch KM_SLEEP to KM_PUSHPAGE
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning.  See commit
b8d06fca08 for additional details.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1138
2012-12-10 09:44:45 -08:00
Brian Behlendorf 53c7411919 Revert "Fix unlink/xattr deadlock"
This reverts commit b00131d43c which
is no longer needed due to e89260a1c8.

This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently.  This effectively prevents
the deadlock condition from ever occuring.

Therefore we can safely revert back to the upstream synchronous
cleanup code.  This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #457
2012-12-05 13:41:30 -08:00
Brian Behlendorf d3aa3ea96e Preserve inode mtime/ctime in .writepage()
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
2012-12-05 13:00:25 -08:00
Brian Behlendorf e89260a1c8 Directory xattr znodes hold a reference on their parent
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent.  Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it.  Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.

This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.

The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX.  Otherwise we
run the risk of deadlocking on the open create TX.

Ideally the security xattr should be fully constructed
before the new inode is unlocked.  However, doing so would
require far more extensive changes to ZFS.

This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #671
2012-12-03 12:10:46 -08:00
Brian Behlendorf c3275b56a1 Add load_nvlist() error handling
Add the missing error handling to load_nvlist().  There's no good
reason this needs to be fatal.  All callers of load_nvlist() do
correctly handle an error condition and it is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1120
2012-11-30 13:48:17 -08:00
Brian Behlendorf 004324ecc6 Disable page allocation warnings for super block
Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings.  The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1101
2012-11-30 11:04:44 -08:00
Brian Behlendorf f74a147c02 Fix NULL deref when zvol_alloc() fails
If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown.  To avoid this entirely the zv->objset
line is moved up in to the success block.

Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1109
2012-11-27 14:10:31 -08:00
George Wilson 32a9872bba Illumos #2671: zpool import should not fail if vdev ashift has increased
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Refererces to Illumos issue:
      https://www.illumos.org/issues/2671

This patch has been slightly modified from the upstream Illumos
version.  In the upstream implementation a warning message is
logged to the console.  To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.

The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool.  Depending on your workload
this may impact pool performance.  Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.

NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #955
2012-11-15 11:05:59 -08:00
Brian Behlendorf 4c837f0d93 Fix "allocating allocated segment" panic
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().

Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record.  The mmap
case which relies on zfs_putpage() was not taking the range
lock.  This code was accidentally dropped when the function
was rewritten for the Linux VFS.

This patch adds the required range locking to zfs_putpage().

It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.

Original-patch-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #541
2012-11-09 19:01:09 -08:00
Brian Behlendorf e26ade5101 Fix zvol+btrfs hang
When using a zvol to back a btrfs filesystem the btrfs mount
would hang.  This was due to the bio completion callback used
in btrfs assuming that lower level drivers would never modify
the bio->bi_io_vecs after they were submitted via bio_submit().
If they are modified btrfs will miscalculate which pages need
to be unlocked resulting in a hang.

It's worth mentioning that other file systems such as ext[234]
and xfs work fine because they do not make the same assumption
in the bio completion callback.

The most straight forward way to fix the issue is to present
the semantics expected by btrfs.  This is done by cloning the
bios attached to each request and then using the clones bvecs
to perform the required accounting.  The clones are freed after
each read/write and the original unmodified bios are linked back
in to the request.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #469
2012-11-09 12:24:51 -08:00
Brian Behlendorf 9dcb971983 Log I/Os longer than zio_delay_max (30s default)
There have been reports of ZFS deadlocking due to what appears to
be a lost IO.  This patch addes some debugging to determine the
exact state of the IO which neither 1) completed, 2) failed, or
3) timed out after zio_delay_max (30) seconds.

This information will be logged using the ZFS FMA infrastructure
as a 'delay' event and posted to the internal zevent log.  By
default the last 64 events will be kept in the log but the limit
is configurable via the zfs_zevent_len_max module option.

To dump the contents of the log use the 'zpool events -v' command
and look for the resource.fs.zfs.delay event.  It will include
various information about the pool, vdev, and zio which may shed
some light on the issue.

In the context of this change the 120 second kernel blocked thread
watchdog has been disabled for synchronous IOs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #930
2012-11-02 15:45:59 -07:00
Brian Behlendorf e95853a331 Add txgs-<pool> kstat file
Create a kstat file which contains useful statistics about the
last N txgs processed.  This can be helpful when analyzing pool
performance.  The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:45:56 -07:00
Brian Behlendorf e8fd45a0f9 Add ddt_object_count() error handling
The interface for the ddt_zap_count() function assumes it can
never fail.  However, internally ddt_zap_count() is implemented
with zap_count() which can potentially fail.  Now because there
was no way to return the error to the caller a VERIFY was used
to ensure this case never happens.

Unfortunately, it has been observed that pools can be damaged in
such a way that zap_count() fails.  The result is that the pool can
not be imported without hitting the VERIFY and crashing the system.

This patch reworks ddt_object_count() so the error can be safely
caught and returned to the caller.  This allows a pool which has
be damaged in this way to be safely rewound for import.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #910
2012-10-29 08:57:45 -07:00
Brian Behlendorf 178e73b376 Revert "Don't ashift-align vdev read requests."
This reverts commit a5c20e2a0a which
accidentally introduced a regression for real 4k sector devices.
See issue #1065 for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1065
2012-10-24 15:25:33 -07:00
Brian Behlendorf f21e5c6a17 Remove 'Resized bio's/dio' warning
The following warning was originally added to provide visibility
in to how often a dio gets heavily fragmented in to over 16 bios.
This can happen due to constraints imposed by the block device
and may have a negitive impact on performance but is otherwise
harmless.  To prevent needless confusion and worry the message
has been removed.

  kernel: WARNING: Resized bio's/dio to 32

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-22 10:17:10 -07:00
Brian Behlendorf c7dfc08629 Quote snapshot and mountpoint for .zfs automount
When automounting a snapshot in the .zfs/snapshot directory
make sure to quote both the dataset name and the mount point.
This ensures that if either component contains spaces, which
are allowed, they get handled correctly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1027
2012-10-17 13:26:18 -07:00
Etienne Dechamps 5d7a86d114 Use the slog even with logbias=throughput.
In the current code, logbias=throughput implies the following:
 1) All synchronous writes are logged in indirect mode.
 2) The slog is not used.

(1) makes sense because it avoids writing the data twice, which is
obviously a good thing when the user wants maximum pool throughput.

(2), however, is a surprising decision. Considering all writes are
indirect, the log record doesn't contain the actual data, only pointers
to DMU blocks. As a result, log records written in logbias=throughput
mode are quite small, and as such, it doesn't make any sense to write
them to the main pool since slogs are usually optimized for small
synchronous writes.

In fact, the current behavior is actually harmful for performance,
because log blocks and data blocks from dmu_sync() seldom have the same
allocation size and as a result are usually allocated from different
metaslabs. This means that if a spindle has to write both log blocks and
DMU blocks (which is likely to happen under heavy load), it will have to
seek between the two. Allocating the log blocks from the slog pool
instead of the main pool avoids these unnecessary seeks.

This commit makes ZFS use the slog on datasets with logbias=throughput.
Real-life performance testing shows a 50% synchronous write performance
increase with some large commit sizes, and no negative effect in other
cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
2012-10-17 08:56:46 -07:00
Etienne Dechamps 920dd524fb Add FASTWRITE algorithm for synchronous writes.
Currently, ZIL blocks are spread over vdevs using hint block pointers
managed by the ZIL commit code and passed to metaslab_alloc(). Spreading
log blocks accross vdevs is important for performance: indeed, using
mutliple disks in parallel decreases the ZIL commit latency, which is
the main performance metric for synchronous writes. However, the current
implementation suffers from the following issues:

1) It would be best if the ZIL module was not aware of such low-level
details. They should be handled by the ZIO and metaslab modules;

2) Because the hint block pointer is managed per log, simultaneous
commits from multiple logs might use the same vdevs at the same time,
which is inefficient;

3) Because dmu_write() does not honor the block pointer hint, indirect
writes are not spread.

The naive solution of rotating the metaslab rotor each time a block is
allocated for the ZIL or dmu_sync() doesn't work in practice because the
first ZIL block to be written is actually allocated during the previous
commit. Consequently, when metaslab_alloc() decides the vdev for this
block, it will do so while a bunch of other allocations are happening at
the same time (from dmu_sync() and other ZILs). This means the vdev for
this block is chosen more or less at random. When the next commit
happens, there is a high chance (especially when the number of blocks
per commit is slightly less than the number of the disks) that one disk
will have to write two blocks (with a potential seek) while other disks
are sitting idle, which defeats spreading and increases the commit
latency.

This commit introduces a new concept in the metaslab allocator:
fastwrites. Basically, each top-level vdev maintains a counter
indicating the number of synchronous writes (from dmu_sync() and the
ZIL) which have been allocated but not yet completed. When the metaslab
is called with the FASTWRITE flag, it will choose the vdev with the
least amount of pending synchronous writes. If there are multiple vdevs
with the same value, the first matching vdev (starting from the rotor)
is used. Once metaslab_alloc() has decided which vdev the block is
allocated to, it updates the fastwrite counter for this vdev.

The rationale goes like this: when an allocation is done with
FASTWRITE, it "reserves" the vdev until the data is written. Until then,
all future allocations will naturally avoid this vdev, even after a full
rotation of the rotor. As a result, pending synchronous writes at a
given point in time will be nicely spread over all vdevs. This contrasts
with the previous algorithm, which is based on the implicit assumption
that blocks are written instantaneously after they're allocated.

metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to
manually increase or decrease fastwrite counters, respectively. They
should be used with caution, as there is no per-BP tracking of fastwrite
information, so leaks and "double-unmarks" are possible. There is,
however, an assert in the vdev teardown code which will fire if the
fastwrite counters are not zero when the pool is exported or the vdev
removed. Note that as stated above, marking is also done implictly by
metaslab_alloc().

ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to
the metaslab when allocating (assuming ZIO does the allocation, which is
only true in the case of dmu_sync). This flag will also trigger an
unmark when zio_done() fires.

A side-effect of the new algorithm is that when a ZIL stops being used,
its last block can stay in the pending state (allocated but not yet
written) for a long time, polluting the fastwrite counters. To avoid
that, I've implemented a somewhat crude but working solution which
unmarks these pending blocks in zil_sync(), thus guaranteeing that
linguering fastwrites will get pruned at each sync event.

The best performance improvements are observed with pools using a large
number of top-level vdevs and heavy synchronous write workflows
(especially indirect writes and concurrent writes from multiple ZILs).
Real-life testing shows a 200% to 300% performance increase with
indirect writes and various commit sizes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
2012-10-17 08:56:41 -07:00
Brian Behlendorf a298dbde92 Condition variable usage, zp->r_{rd,wr}_cv
The following incorrect usage of cv_broadcast() was caught by
code inspection.  The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-15 16:02:03 -07:00
Brian Behlendorf 8c0712fd88 Condition variable usage, zilog->zl_cv_batch
The following incorrect usage of cv_signal and cv_broadcast()
was caught by code inspection.  The cv_signal and cv_broadcast()
functions must be called under the associated mutex to preventing
racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-15 16:01:58 -07:00
Brian Behlendorf 99db9bfde7 Condition variable usage, zevent_cv
The following incorrect usage of cv_broadcast() was caught by
code inspection.  The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-15 16:01:54 -07:00
Massimo Maggi 6f53a6a229 Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1038
2012-10-15 09:32:38 -07:00
Brian Behlendorf c418410393 Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE
Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520
2012-10-15 09:28:43 -07:00
Yuxuan Shui 45ca2d91cb Return positive error number in zfsctl_shares_lookup.
Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes #626 #885 #947 #977
2012-10-15 09:11:56 -07:00
Yuxuan Shui 558ef6d080 Linux 3.6 compat, iops->create()
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create.  Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 14:42:25 -07:00
Yuxuan Shui 8f195a908f Linux 3.6 compat, iops->lookup()
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup.  Instead
only the inamedata->flags are passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:54 -07:00
Yuxuan Shui 3c20361075 Linux 3.6 compat, sget()
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.

ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:48 -07:00
Yuxuan Shui af26c4d4ab Linux 3.6 compat, sops->write_super() removed
The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.

ZFS never made use of this callback so it can simply be removed
from the super_operations structure.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 11:33:56 -07:00
Etienne Dechamps a5c20e2a0a Don't ashift-align vdev read requests.
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.

This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.

As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
  the path that the additional, useless padding data has to travel.
  For example, when using 4k-sector drives that lie about their sector
  size, using 512b read requests instead of 4k means that there will
  be less data traveling down the ATA/SCSI interface, even though the
  drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole
allocated block for parity.

This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1022
2012-10-12 12:01:56 -07:00
Richard Yao b68503fb30 Remove vmem_size() consumers
There are currently three vmem_size() consumers all of which are
part of the ARC implemention.  However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.

* arc_evict_needed() - This is actually dead code.  Arena support
was never added to the SPL and zio_arena is always NULL.  This
support isn't needed so we simply remove this dead code.

* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free.  However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC.  So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.

* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #831
2012-10-12 10:03:03 -07:00
Brian Behlendorf 87d98efe9e Fix zfs_txg_timeout module parameter
Allow the zfs_txg_timeout variable to be dynamically tuned at run
time.  By pulling it down out of the variable declaration it will
be evaluted each time through the loop.

The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c.  This prevents
potential type mismatches if the global variable needs to be used
elsewhere.

Move the module_param() code in to the same source file where
zfs_txg_timeout is declared.  This is the most logical location.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 15:07:09 -07:00
Richard Yao 7df05a4266 Fix zfs_write_limit_max integer size mismatch on 32-bit systems
Commit c409e4647f introduced a
number of module parameters.  This required several types to be
changed to accomidate the required module parameters Linux macros.

Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart.  If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.

The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.

We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #749
2012-10-11 11:09:25 -07:00
Cyril Plisko 15fd274973 Make zfs_immediate_write_sz a module paramater
zfs_immediate_write_sz variable is a tunable, but lacks proper
module_param() instrumentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1032
2012-10-11 11:09:21 -07:00
Cyril Plisko 5b7e5b5ab9 txg is spelled as tgx in places
Term 'transaction group' is commonly abbreviated as txg in ZFS sources.
There are some places (Linux specific MODULE_PARAM_DESC() macros)
where it is incorrectly spelled as 'tgx'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1030
2012-10-11 09:19:08 -07:00
Massimo Maggi beb999445a Switch KM_SLEEP to KM_PUSHPAGE
Prevent snapshot_check to initiate I/O during memory allocation.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1023
2012-10-08 10:19:05 -07:00
Brian Behlendorf 7bd04f2d7d Set default zvol elevator to noop
It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device.  Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available.  This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.

We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes.  It's
better to simply manually tune the scheduler for these kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1017
2012-10-05 12:39:59 -07:00
Etienne Dechamps 089fa91bc5 Align DISCARD requests on zvols.
Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.

Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.

This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1010
2012-10-04 16:01:44 -07:00
Chris Dunlop d75d6f294e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fc for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1002
2012-10-04 10:44:09 -07:00
Matthew Ahrens 04434775b7 Illumos #3100: zvol rename fails with EBUSY when dirty.
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222

3100 zvol rename fails with EBUSY when dirty

Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #995
2012-10-03 13:59:02 -07:00
George Wilson 65947351e7 Illumos #3129, #3130
illumos/illumos-gate@d6afdce20f
Illumos changeset: 13794:7c5e0e746b2c

3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
     0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3129
  https://www.illumos.org/issues/3130

Ported by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #994
2012-10-03 13:59:02 -07:00
Brian Behlendorf 6d1d976b2c Modify vdev_elevator_switch() to use elevator_change()
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.

Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #906
2012-10-03 13:31:44 -07:00
Cyril Plisko 393b44c711 Implement .commit_metadata hook for NFS export
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook.  All it takes there
is to make sure changes are committed to ZIL.  Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #969
2012-10-03 10:49:45 -07:00
Chris Wedgwood 23a61ccc1b zvol_probe should return NULL when the device isn't found.
Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #949
Closes #931
Closes #789
Closes #743
Closes #730
2012-10-03 10:39:12 -07:00
Bill Pijewski 37abac6d55 Illumos #2703: add mechanism to report ZFS send progress
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2703

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-19 13:39:06 -07:00
Chris Siden 1bd201e70d Illumos #1948: zpool list should show more detailed pool info
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1948

Ported by:	Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #685
2012-09-19 13:39:05 -07:00
Brian Behlendorf 95fd8c9a7f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973
2012-09-19 11:52:36 -07:00
Brian Behlendorf ba367276d8 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-17 11:22:23 -07:00
Cyril Plisko 49d39798f2 ZFS replay transaction error 5
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.

The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #933
2012-09-17 11:06:58 -07:00
Brian Behlendorf 8312c6df55 Clear PG_writeback for sync I/O error case
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.

The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit.  Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #961
2012-09-14 15:53:47 -07:00
Brian Behlendorf 5915791096 Move iput() after zfs_inode_update()
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed.  This will result in a
NULL dereference as shown by the stack trace is issue #782.

This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #782
2012-09-12 14:22:52 -07:00
Brian Behlendorf 4ca9a43644 Remove zvol device node
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL.  However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().

Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case.  Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy().  However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.

This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.

  https://www.illumos.org/issues/3170

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
2012-09-10 10:25:08 -07:00
Cyril Plisko 04f9432d3b Make ZFS filesystem id persistent across different machines
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #888
2012-09-06 12:47:11 -07:00
Brian Behlendorf ebcfc8a534 Disable page allocation warnings for ARC buffers
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.

These objects can be as large as order 5 on a system with 4k
pages.  And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation.  This can result in the occasional
allocation failure.

However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed.  Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
2012-09-06 11:53:08 -07:00
Brian Behlendorf cafa9709f3 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-05 08:44:58 -07:00
Brian Behlendorf 0ef0ff546e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 16:00:06 -07:00
Brian Behlendorf 594b4dd82a Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 08:41:12 -07:00
Chris Dunlop 20a083cbe2 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-02 10:15:49 -07:00
Brian Behlendorf b404a3f07f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-08-31 17:39:29 -07:00
Brian Behlendorf 2b2861362f Clear PG_writeback after zil_commit() for sync I/O
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback.  However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit.  There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.

This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds.  One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #700
Closes #907
2012-08-30 20:16:28 -07:00
Richard Yao b8d06fca08 Switch KM_SLEEP to KM_PUSHPAGE
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf 991fc1d7ae mzap_upgrade() must use kmem_alloc()
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf 8630650a8d Annotate KM_PUSHPAGE call paths with PF_NOFS
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.

This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected.   In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected.  When debugging is enabled
any misuse will be treated as a fatal error.

This patch is entirely for debugging.  We should be careful to
NOT become dependant on it fixing up the incorrect allocations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf 86dd0fd922 Pre-allocate vdev I/O buffers
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests.  Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...

1) These buffers are short lived.  They are only required for
the life of a single I/O at which point they can be used by
the next I/O.

2) The maximum number of concurrent buffers needed by a vdev is
small.  It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.

By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it.  This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock.  This is
particularly important when memory is already low on the system.

It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation.  The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Richard Yao 44f21da41c Revert Disable direct reclaim for z_wr_* threads
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Richard Yao 62c4165a1b Revert Fix zpl_writepage() deadlock
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Richard Yao b876dac776 Revert Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf cd38ac58a3 rmdir(2) should return ENOTEMPTY
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries.  However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):

ENOTEMPTY
   pathname contains entries other than . and .. ; or, pathname has
   ..  as its final component.  POSIX.1-2001 also allows EEXIST for
   this condition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #895
2012-08-26 13:55:45 -07:00
Christopher Siden 9e11c7eee2 Illumos #3085: zfs diff panics, then panics in a loop on booting
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3085

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-25 12:32:25 -07:00
Simon Klinkert c578f007ff Illumos #2901: zfs receive fails for exabyte sparse files
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/2901

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-25 12:28:29 -07:00
Javen Wu a47587389e Drop spill buffer reference
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update.  When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit().  Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf().  This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.

This could be left to the sa_update() caller.  But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #503
Closes #513
2012-08-25 09:26:10 -07:00
Brian Behlendorf f828e63a0d Revert "Use SA_HDL_PRIVATE for SA xattrs"
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727.  It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #709
Closes #727
2012-08-25 09:25:56 -07:00
Prakash Surya 15a9e03368 Wrap smp_processor_id in kpreempt_[dis|en]able
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
2012-08-24 13:19:06 -07:00
Brian Behlendorf 4047414a6a Export dmu_buf_rele() symbol
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols.  Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-14 08:38:19 -07:00
Brian Behlendorf bafc4e9e2a Suppress 'zfs_sb_create' memory warning
When mutex debugging is enabled in your kernel the increased
size of the mutex structures can push the zfs_sb_t type beyond
the 8k warning threshold.  This isn't harmful so we suppress
the warning for this case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #628
2012-08-10 16:43:32 -07:00
Brian Behlendorf 8f576c2321 Export dbuf_* symbols
Export these symbols so they may be used by other ZFS consumers
besides the ZPL.

Remove three stale prototype definites from dbuf.h.  The actual
implementations of these functions were removed/renamed long ago.

It would be good in the long term to remove the existing pragmas
we inherited from Solaris and simply use the dbuf_* names.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-10 16:45:13 -07:00
Dan McDonald d96eb2b153 Illumos #1693: persistent 'comment' field for a zpool
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1693

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #678
2012-08-08 11:49:37 -07:00
Etienne Dechamps ee5fd0bb80 Set zvol discard_granularity to the volblocksize.
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862
2012-08-07 14:55:31 -07:00
Etienne Dechamps 7c0e570888 Limit the number of blocks to discard at once.
The number of blocks that can be discarded in one BLKDISCARD ioctl on a
zvol is currently unlimited. Some applications, such as mkfs, discard
the whole volume at once and they use the maximum possible discard size
to do that. As a result, several gigabytes discard requests are not
uncommon.

Unfortunately, if a large amount of data is allocated in the zvol, ZFS
can be quite slow to process discard requests. This is especially true
if the volblocksize is low (e.g. the 8K default). As a result, very
large discard requests can take a very long time (seconds to minutes
under heavy load) to complete. This can cause a number of problems, most
notably if the zvol is accessed remotely (e.g. via iSCSI), in which case
the client has a high probability of timing out on the request.

This patch solves the issue by adding a new tunable module parameter:
zvol_max_discard_blocks. This indicates the maximum possible range, in
zvol blocks, of one discard operation. It is set by default to 16384
blocks, which appears to be a good tradeoff. Using the default
volblocksize of 8K this is equivalent to 128 MB. When using the maximum
volblocksize of 128K this is equivalent to 2 GB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #858
2012-07-31 09:46:09 -07:00
Matthew Ahrens 330d06f90d Illumos #1644, #1645, #1646, #1647, #1708
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
     destroying multiple snapshots
1708 adjust size of zpool history data

References:
  https://www.illumos.org/issues/1644
  https://www.illumos.org/issues/1645
  https://www.illumos.org/issues/1646
  https://www.illumos.org/issues/1647
  https://www.illumos.org/issues/1708

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  This change has reordered
all of the new ioctl()s to the end of the list.  This should
help minimize this issue in the future.

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #826
Closes #664
2012-07-31 09:25:30 -07:00
Etienne Dechamps 2ee4a18b2a Add script for builtin module building.
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building ZFS as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the zfs source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the zfs
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
2012-07-26 13:45:09 -07:00
Richard Yao 739a1a82e0 Linux 3.5 compat, end_writeback() changed to clear_inode()
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict().   This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.

However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics.  This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #784
2012-07-23 12:29:36 -07:00
Richard Yao ea1fdf46e2 Linux 3.5 compat, iops->truncate_range() removed
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:32 -07:00
Richard Yao 756c3e5a9c Linux 3.5 compat, eops->encode_fh() takes inodes
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes.  This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:23 -07:00
Brian Behlendorf fc173c8589 Disable .zfs directory on 32-bit systems
The .zfs control directory implementation currently relies on
the fact that there is a direct 1:1 mapping from an object id
to its inode number.  This works well as long as the system
uses a 64-bit value to store the inode number.

Unfortunately, the Linux kernel defines the inode number as
an 'unsigned long' type.  This means that for 32-bit systems
will only have 32-bit inode numbers but we still have 64-bit
object ids.

This problem is particularly acute for the .zfs directories
which leverage those upper 32-bits.  This is done to avoid
conflicting with object ids which are allocated monotonically
starting from 0.  This is likely to also be a problem for
datasets on 32-bit systems with more than ~2 billion files.

The right long term fix must remove the simple 1:1 mapping.
Until that's done the only safe thing to do is to disable the
.zfs directory on 32-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 12:20:57 -07:00
Brian Behlendorf 2a4a9dc2f0 Add ddt_object_load() error handling
Add the missing error handling to ddt_object_load().  There's no
good reason this needs to be fatal.  It is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 10:36:21 -07:00
Brian Behlendorf 10be533e33 Add 'inline' keyword
The '__attribute__((always_inline))' does not strictly imply
'inline'.  Newer versions of gcc detect this misuse and issue
the following warning.  Including the missing 'inline' resolves
the build warning.

    ./module/zfs/dsl_scan.c:758:1:error: always_inline function
    might not be inlinable [-Werror=attributes]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-19 13:41:00 -07:00
Richard Yao 0a6b03d3b8 Fix build failures on PaX/GRSecurity patched kernels
Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.

To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link().  This avoids compiler errors
in the PaX/GRSecurity dialect.

Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #484
2012-07-17 09:22:43 -07:00
Etienne Dechamps b5a28807cd Move partition scanning from userspace to module.
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808
2012-07-17 09:17:31 -07:00
George Wilson c7f2d69de3 Illumos #1949, #1953
1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1949
  https://www.illumos.org/issues/1953

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:33:31 -07:00
Garrett D'Amore 3541dc6d02 Illumos #1748: desire support for reguid in zfs
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:08:56 -07:00
Brian Behlendorf 42d3b990cf Update incorrect ddt_zap_lookup() assertion
When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated.  The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
2012-07-03 15:14:34 -07:00
Etienne Dechamps b6ad9671ac Add ZIL statistics.
The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.

This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.

Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #786
2012-06-29 09:56:51 -07:00
Pawel Jakub Dawidek 0cee24064a Speed up 'zfs list -t snapshot -o name -s name'
FreeBSD #xxx:  Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:

  # zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot
properties.

Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:

    before:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 525s
        warm cache: 218s

    after:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 1.7s
        warm cache: 1.1s

NOTE: This patch only appears in FreeBSD.  If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
2012-06-14 09:49:04 -07:00
Darik Horn 74497b7ab6 Add zvol_inhibit_dev module option.
ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.

When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.

The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-06-13 17:05:16 -07:00
Etienne Dechamps ee191e802c Make zil_slog_limit a tunable module parameter.
zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.

The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #783
2012-06-12 08:45:53 -07:00
Richard Yao 6a0936babc Linux 3.4 compat, d_make_root() replaces d_alloc_root()
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776
2012-06-11 10:04:49 -07:00
Etienne Dechamps ab85f8455b Honor logbias when writing to ZVOLs.
The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #774
2012-06-11 09:43:48 -07:00
Brian Behlendorf 710114089f Revert "Disable direct reclaim on zvols"
This reverts commit ce90208cf9.  This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels.  This issue was filed as #710.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710
2012-04-30 14:26:49 -07:00
Brian Behlendorf b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Richard Yao ce90208cf9 Disable direct reclaim on zvols
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #342
2012-04-30 11:25:36 -07:00
Richard Yao 518b487602 Update ARC memory limits to account for SLUB internal fragmentation
23bdb07d4e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.

This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660
2012-04-30 10:04:34 -07:00
Brian Behlendorf 302f753f16 Integrate ARC more tightly with Linux
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem.  It would attempt to recognize low memory situtions
before they occured and evict data from the cache.  It would also
make assessments about if there was enough free memory to perform
a specific operation.

This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available.  To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces.  While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory.  This has manifested
itself in the past as a spinning arc_reclaim_thread.

This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface.  That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback.  The Linux VM will call this
function when it needs memory.  The ARC is then responsible for
attempting to free the requested amount of memory if possible.

Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.

* Removed the hdr_recl() reclaim callback which is redundant
  with the broader arc_shrinker_func().

* Reduced arc_grow_retry to 5 seconds from 60.  This is now used
  internally in the ARC with arc_no_grow to indicate that direct
  reclaim was recently performed.  This typically indicates a
  rapid change in memory demands which the kswapd threads were
  unable to keep ahead of.  As long as direct reclaim is happening
  once every 5 seconds arc growth will be paused to avoid further
  contributing to the existing memory pressure.  The more common
  indirect reclaim paths will not set arc_no_grow.

* arc_shrink() has been extended to take the number of bytes by
  which arc_c should be reduced.  This allows for a more granual
  reduction of the arc target.  Since the kernel provides a
  reclaim value to the arc_shrinker_func() this value is used
  instead of 1<<arc_shrink_shift.

* arc_reclaim_needed() has been removed.  It was used to determine
  if the system was under memory pressure and relied extensively
  on Solaris specific VM interfaces.  In most case the new code
  just checks arc_no_grow which indicates that within the last
  arc_grow_retry seconds direct memory reclaim occurred.

* arc_memory_throttle() has been updated to always include the
  amount of evictable memory (arc and page cache) in its free
  space calculations.  This space is largely available in most
  call paths due to direct memory reclaim.

* The Solaris pageout code was also removed to avoid confusion.
  It has always been disabled due to proc_pageout being defined
  as NULL in the Linux port.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-30 10:03:05 -07:00
Brian Behlendorf afec56b43f Add zfs_mdcomp_disable module option
Expose the zfs_mdcomp_disable variable as a module option.  This
can be used to disable compression of zfs meta data which is
enabled by default.  This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-27 16:28:02 -07:00
Brian Behlendorf ebf8e3a237 Illumos #1909: disk sync write perf regression when slog is used post oi_148
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererces to Illumos issue:
  https://www.illumos.org/issues/1909

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #680
2012-04-19 16:26:29 -07:00
Prakash Surya 409dc1a570 Use KM_PUSHPAGE in l2arc_write_buffers
There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:

    crash> bt 4123
    PID: 4123   TASK: ffff88062f8c1500  CPU: 6   COMMAND: "l2arc_feed"
      0 [ffff88062511d610] schedule at ffffffff814eeee0
      1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
      2 [ffff88062511d748] mutex_lock at ffffffff814f041b
      3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
      4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
      5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
      6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
      7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
      8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
      9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
     10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
     11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
     12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
     13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
     14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
     15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
     16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
     17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
     18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
     19 [ffff88062511dee8] kthread at ffffffff81090a86
     20 [ffff88062511df48] kernel_thread at ffffffff8100c14a

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-17 11:56:21 -07:00
Martin Matuska 7d5cd71da6 Illumos #1346: zfs incremental receive may leave behind temporary clones
1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1346
  https://www.illumos.org/issues/1356

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #647
2012-04-11 12:02:27 -07:00
Albert Lee 22cd4a4653 Illumos #1475: zfs spill block hold can access invalid spill blkptr
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1475

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648
2012-04-11 11:46:30 -07:00
George Wilson 5ffb9d1d05 Illumos #1951: leaking a vdev when removing an l2cache device
1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References to Illumos issues:
  https://www.illumos.org/issues/1951
  https://www.illumos.org/issues/1952
  https://www.illumos.org/issues/1954

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #650
2012-04-11 11:32:06 -07:00
Martin Matuska b129c6590e OS-926: zfs panic in zfs_fill_zplprops_impl()
This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling.  This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #652
2012-04-11 11:29:19 -07:00
Andriy Gapon 3adfc400f5 Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_tsd
vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

References to Illumos issue:
  https://www.illumos.org/issues/1680

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #655
2012-04-11 11:23:18 -07:00
Brian Behlendorf f0fd83be65 Export additional dsl symbols
Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions.  They allow
us to cleanly register a callback which is called when a
dataset property is modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-11 09:26:55 -07:00
Gunnar Beutner 1f0d8a566f Fixed a NULL pointer dereference bug in zfs_preumount
When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.

This bug can be reproduced using this shell script:

 #!/bin/sh
 (
 while true; do
 	zfs create -o mountpoint=legacz tank/bar
 	zfs destroy tank/bar
 done
 ) &

 (
 while true; do
 	mount -t zfs tank/bar /mnt
 	umount /mnt
 done
 ) &

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #639
2012-04-05 11:29:42 -07:00
Brian Behlendorf fc41c6402b Properly expose the mfu ghost list kstats
Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats.  This was harmless but
confusing since memory usage could be over reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-27 15:08:22 -07:00
Brian Behlendorf 1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf 99ea23c583 Enhance a dmu_tx_dirty_buf() assertion
The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.

  SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
  ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:24:05 -07:00
Brian Behlendorf 4b5d425f14 Add ZFS_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:14:35 -07:00
Brian Behlendorf 9ed86e7cc7 Account for .zfs ctldir inodes
Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used.  This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().

In my option there's a kernel bug here too which allows this to
happen.  They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #617
2012-03-22 15:43:55 -07:00
Brian Behlendorf ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Brian Behlendorf 49be0ccf1f Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-21 14:51:44 -07:00
Brian Behlendorf c8df41538d Revert "Add zio constructor/destructor"
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1e07.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604
2012-03-21 14:51:01 -07:00
Brian Behlendorf 77a405ae52 Add missing NULL in zpl_xattr_handlers
The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated.  This NULL was
accidentally missing which could result in a NULL dereference.

Interestingly this issue only manifested itself on certain 32-bit
systems.  Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594
2012-03-15 15:18:29 -07:00
Brian Behlendorf 0ece356db5 Add sa_spill_rele() interface
Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle.  This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold.  Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-07 16:28:00 -08:00
Brian Behlendorf 2c6d0b1e07 Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-07 16:06:23 -08:00
Brian Behlendorf ec2626ad3f Use SA_HDL_PRIVATE for SA xattrs
A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit().  If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made.  This is done to
prevent data from leaking in to the syncing txg.  As a result
the original dirty spill block will remain cached.

Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache.  It's far safer and simpler just to use a private
handle for xattrs.  Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.

These forever dirty buffers can be noticed in the arcstats under
the anon_size.  On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase.  Eventually, if enough dirty data builds up
your system it will appear to hang.  This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.

As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513
2012-03-02 13:20:48 -08:00
Brian Behlendorf 570827e129 Add 'dmu_tx' kstats entry
Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg.  This can be useful
when attempting to determine why a particular workload is
under performing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:59:10 -08:00
Brian Behlendorf 13be560d89 Add arc_state_t stats to arcstats
To ensure the arc is behaving properly we need greater visibility
in to exactly how it's managing the systems memory.  This patch
takes one step in that direction be adding the current arc_state_t
for the anon, mru, mru_ghost, mfu, and mfs_ghost lists.  The l2
arc_state_t is already well represented in the arcstats.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:58:59 -08:00
Alex Zhuravlev a473d90cee Export symbols for zero-copy
Export additional symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-17 12:43:02 -08:00
Richard Yao b41c9906dc Support ashift=13 for 8KB SSD block sizes
New SSDs are now available which use an internal 8k block size.
To make sure ZFS can get the maximum performance out of these
devices we're increasing the maximum ashift to 13 (8KB).

This value is still small enough that we can fit 16 uberblocks
in the vdev ring label.  However, I don't want to increase this
any futher or it will limit the ability the safely roll back a
pool to recover it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565
2012-02-13 12:25:27 -08:00
Brian Behlendorf b10c77f70a Export symbols for zero-copy
Exported the required symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-10 11:56:55 -08:00
Brian Behlendorf a31acb462d Use spl_debug_* helpers
When configuring the spl debug log support use the provided wrapper
functions.  This ensures that if --disable-debug-log was used when
buiding the spl the functions will have no effect.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:37:48 -08:00
Etienne Dechamps 30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps aec69371a6 Check permissions in zfs_space().
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:37 -08:00
Etienne Dechamps 5cb63a57f8 Implement the truncate_range() inode operation.
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:32 -08:00
Etienne Dechamps dde9380a1b Use 32 as the default number of zvol threads.
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.

This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:

> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.

We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.

On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.

We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392
2012-02-08 13:58:10 -08:00
Etienne Dechamps 34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps 56c34bac44 Support "sync=always" for ZVOLs.
Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #374.
2012-02-07 16:23:06 -08:00
Brian Behlendorf 47621f3d76 Linux 3.3 compat, sops->show_options()
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'.  Add an autoconf check
to detect the API change and then conditionally define the expected
interface.  In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549
2012-02-03 10:02:01 -08:00
Brian Behlendorf d7e398ce1a Cleanup ZFS debug infrastructure
Historically the internal zfs debug infrastructure has been
scattered throughout the code.  Since we expect to start making
more use of this code this patch performs some cleanup.

* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
  files.  This includes moving the zfs_flags and zfs_recover
  variables, plus moving the zfs_panic_recover() function.

* Remove the existing unused functionality in zfs_debug.c and
  replace it with code which correctly utilized the spl logging
  infrastructure.

* Remove the __dprintf() function from zfs_ioctl.c.  This is
  dead code, the dprintf() functionality in the kernel relies
  on the spl log support.

* Remove dprintf() from hdr_recl().  This wasn't particularly
  useful and was missing the required format specifier anyway.

* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
  functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:24:30 -08:00
Brian Behlendorf 0c5dde492f Allow multiple values per directory entry
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap.  The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.

This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id.  The zpl
code will just ignore additional entries.  This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.

Once the upstream feature flags support is merged in this change
should be updated to a read-only feature.  Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:22:08 -08:00
Brian Behlendorf e29be02e46 Export symbol zfs_attr_table
Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-27 09:23:36 -08:00
Richard Laager 57a4eddc4d Allow setting bootfs on any pool
The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code.  FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317
2012-01-17 13:49:07 -08:00
Ned Bass 08d08ebba2 Reduce number of zio free threads
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.

Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue.  It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue.  When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it.  They then all pile into the work wait queue to wait for
new work items.  So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.

Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem.  It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread.  In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling.  So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458
2012-01-17 08:54:00 -08:00
Darik Horn 96b91ef0d6 Apply the ZoL coding standard to zpl_xattr.c
Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-12 15:12:03 -08:00
Brian Behlendorf 166dd49de0 Linux 3.2 compat, security_inode_init_security()
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.

This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.

In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types.  This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516
2012-01-12 15:06:39 -08:00
Brian Behlendorf ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
Darik Horn 28eb9213d8 Linux 3.2 compat: set_nlink()
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

  SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
2011-12-16 20:02:52 -08:00
Garrett D'Amore a38718a63d Illumos #734: Use taskq_dispatch_ent() interface
It has been observed that some of the hottest locks are those
of the zio taskqs.  Contention on these locks can limit the
rate at which zios are dispatched which limits performance.

This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock.  This has the effect
of improving system performance.

Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/734

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #482
2011-12-14 09:19:30 -08:00
Brian Behlendorf 30a9524e45 Set zvol_major/zvol_threads permissions
The zvol_major and zvol_threads module options were being created
with 0 permission bits.  This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`.  This patch fixes the issue by updating
the permission bits to 0444.  For the moment these options must
be read-only because they are used during module initialization.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
2011-12-07 09:27:50 -08:00
Brian Behlendorf 23bdb07d4e Update default ARC memory limits
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.

To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB.  The rational for
this is as follows:

* Desktop Systems (less than 8GB of memory)

  Limiting the ARC to 1/2 of memory is desirable for desktop
  systems which have highly dynamic memory requirements.  For
  example, launching your web browser can suddenly result in a
  demand for several gigabytes of memory.  This memory must be
  reclaimed from the ARC cache which can take some time.  The
  user will experience this reclaim time as a sluggish system
  with poor interactive performance.  Thus in this case it is
  preferable to leave the memory as free and available for
  immediate use.

* Server Systems (more than 8GB of memory)

  Using all but 4GB of memory for the ARC is preferable for
  server systems.  These systems often run with minimal user
  interaction and have long running daemons with relatively
  stable memory demands.  These systems will benefit most by
  having as much data cached in memory as possible.

These values should work well for most configurations.  However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC.  This can still be accomplished
by setting the 'zfs_arc_max' module option.

Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation.  Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory.  How much more memory get's
consumed will be determined by how badly fragmented the slabs are.

In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution.  Or preferably, using the page cache
to back the ARC under Linux would be even better.  See issue #75
for the benefits of more tightly integrating with the page cache.

This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory.  The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken.  This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
2011-12-05 12:02:12 -08:00
Brian Behlendorf f31b3ebe6e Allow xattrs on symlinks
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call.  Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.

The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes.  This was done simply to be consistent with the Solaris
behavior.

Upon futher reflection I believe this should be allowed under
Linux.  The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system.  This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #272
2011-11-29 10:24:24 -08:00
Brian Behlendorf 82a37189aa Implement SA based xattrs
The current ZFS implementation stores xattrs on disk using a hidden
directory.  In this directory a file name represents the xattr name
and the file contexts are the xattr binary data.  This approach is
very flexible and allows for arbitrarily large xattrs.  However,
it also suffers from a significant performance penalty.  Accessing
a single xattr can requires up to three disk seeks.

  1) Lookup the dnode object.
  2) Lookup the dnodes's xattr directory object.
  3) Lookup the xattr object in the directory.

To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk.  When
the xattr is to large to store in the inode then a single external
block is allocated for them.  In practice most xattrs are small
and this approach works well.

The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization.  When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block.  If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.

This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below).  When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.

However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations.  If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.

----------------------------------------------------------------------
   Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
      |            setxattr            |            getxattr
bytes |  ext4     xfs zfs-dir  zfs-sa  |  ext4     xfs zfs-dir  zfs-sa
------+--------------------------------+------------------------------
1     |  2.33   31.88   21.50    4.57  |  2.35    2.64    6.29    2.43
32    |  2.79   30.68   21.98    4.60  |  2.44    2.59    6.78    2.48
256   |  3.25   31.99   21.36    5.92  |  2.32    2.71    6.22    3.14
1024  |  3.30   32.61   22.83    8.45  |  2.40    2.79    6.24    3.27
4096  |  3.57  317.46   22.52   10.73  |  2.78   28.62    6.90    3.94
16384 |   n/a 2342.39   34.30   19.20  |   n/a   45.44  145.90    7.55
65536 |   n/a 2941.39  128.15  131.32* |   n/a  141.92  256.85  262.12*

Legend:
* ext4      - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs       - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir   - Directory based xattrs only.
* zfs-sa    - Prefer SAs but spill in to directories as needed, a
              trailing * indicates overflow in to directories occured.

NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
2011-11-28 15:45:51 -08:00
Brian Behlendorf ca5fd24984 Limit maximum ashift value to 12
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe.  What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size.  Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.

Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #425
2011-11-11 14:50:48 -08:00
Brian Behlendorf adcd70bd1a Linux 3.1 compat, fops->fsync()
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback.  In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.

This commit updates the code to provide a zpl_fsync() function
for the updated API.  Implementations for the previous two APIs
are also maintained for compatibility.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #445
2011-11-10 10:03:08 -08:00
Brian Behlendorf 5547c2f1bf Simplify BDI integration
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code.  The updated code now just
registers the bdi during mount and destroys it during unmount.

The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it.  Luckily the bdi_setup_and_register() function
is trivial.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2011-11-08 10:19:03 -08:00
Brian Behlendorf 591fb62f19 Disown dataset in zfs_sb_create()
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot.  Disown the dataset if SA are expected but
are in fact missing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-11-08 10:18:40 -08:00
Brian Behlendorf ae6ba3dbe6 Improve meta data performance
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound.  A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.

It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory.  By default a kmem_cache will
dynamically determine the type of memory used based on the object
size.  For large objects virtual memory is usually preferable
and for small object physical memory is a better choice.  See
the spl_slab_alloc() function for a longer discussion on this.

However, there is a certain amount of gray area when defining a
'large' object.  For the following caches it turns out they were
just over the line:

  * dnode_cache
  * zio_cache
  * zio_link_cache
  * zio_buf_512_cache
  * zfs_data_buf_512_cache

Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses.  This entirely avoids the
need to serialize on the virtual address space lock.

As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
2011-11-03 10:19:21 -07:00
Brian Behlendorf 6a95d0b74c Fix NULL deref in balance_pgdat()
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
zpl_putpage() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
2011-11-03 10:15:39 -07:00
Gunnar Beutner a7b125e9a5 Fix a race condition in zfs_getattr_fast()
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #431
2011-11-03 10:13:09 -07:00
Xin Li c475167627 Illumos #1661: Fix flaw in sa_find_sizes() calculation
When calculating space needed for SA_BONUS buffers, hdrsize is
always rounded up to next 8-aligned boundary. However, in two places
the round up was done against sum of 'total' plus hdrsize. On the
other hand, hdrsize increments by 4 each time, which means in certain
conditions, we would end up returning with will_spill == 0 and
(total + hdrsize) larger than full_space, leading to a failed
assertion because it's invalid for dmu_set_bonus.

Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1661

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426
2011-10-24 09:57:52 -07:00
Darik Horn 3cee2262a6 Change sun.com URLs to zfsonlinux.org
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline.  Change these error messages
to use the zfsonlinux.org mirror instead.

This commit depends on:

  zfsonlinux/zfsonlinux.github.com@8e10ead3dc

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-24 09:52:21 -07:00
Brian Behlendorf 6f2255ba8a Set mtime on symbolic links
Register the setattr/getattr callbacks for symlinks.  Without these
the generic inode_setattr() and generic_fillattr() functions will
be used.  In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.

The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #412
2011-10-18 15:49:31 -07:00
Alexander Stetsenko 8d35c1499d Illumos #755: dmu_recv_stream builds incomplete guid_to_ds_map
An incomplete guid_to_ds_map would cause restore_write_byref() to fail
while receiving a de-duplicated backup stream.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D`Amore <garrett@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/755
- https://github.com/illumos/illumos-gate/commit/ec5cf9d53a

Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #372
2011-10-18 11:18:14 -07:00
Brian Behlendorf 86f35f34f4 Export symbols for the VFS API
Export all symbols already marked extern in the zfs_vfsops.h
header.  Several non-static symbols have also been added to
the header and exportewd.  This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.

Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base.  All other zfsvfs_* functions have already been renamed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:25:59 -07:00
Brian Behlendorf e45aa45298 Export symbols for the full SA API
Export all the symbols for the system attribute (SA) API.  This
allows external module to cleanly manipulate the SAs associated
with a dnode.  Documention for the SA API can be found in the
module/zfs/sa.c source.

This change also removes the zfs_sa_uprade_pre, and
zfs_sa_uprade_post prototypes.  The functions themselves were
dropped some time ago.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-05 15:59:56 -07:00
Andreas Dilger baab063016 zpl: Fix "df -i" to have better free inodes value
Due to the confusion in Linux statfs between f_frsize and f_bsize
the blocks counts were changed to be in units of z_max_blksize
instead of SPA_MINBLOCKSIZE as it is on other platforms.

However, the free files calculation in zfs_statvfs() is limited by
the free blocks count, since each dnode consumes one block/sector.
This provided a reasonable estimate of free inodes, but on Linux
this meant that the free inodes count was underestimated by a large
amount, since 256 512-byte dnodes can fit into a 128kB block, and
more if the max blocksize is increased to 1MB or larger.

Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since
DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and
may even change per dataset, and devices with large sectors setting
ashift will also use a larger blocksize.

Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT)
to more accurately compute the maximum number of dnodes that can
be created.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #413
Closes #400
2011-09-28 11:27:10 -07:00
Brian Behlendorf dee28b0700 Export symbols for the full ZAP API
Export all the symbols for the ZAP API.  This allows external modules
to cleanly interface with ZAP type objects.  Previously only a subset
of the functionality was exposed.  Documention for the ZAP API can be
found in the sys/zap.h header.

This change also removes a duplicate zap_increment_int() prototype.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-27 16:12:36 -07:00
Brian Behlendorf fa6e5ced2f Suppress kmem_alloc() warning in zfs_prop_set_special()
Suppress the warning for this large kmem_alloc() because it is not
that far over the warning threshhold (8k) and it is short lived.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-15 20:26:51 -07:00
Brian Behlendorf 2708f716c0 Fix usage of zsb after free
Caught by code inspection, the variable zsb was referenced after
being freed.  Move the kmem_free() to the end of the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-09 10:29:48 -07:00
Brian Behlendorf 95d9fd028b Fix incompatible pointer type warning
This warning was accidentally introduced by commit
f3ab88d646 which updated the
.readpages() implementation.  The fix is to simply cast
the helper function to the appropriate type when passed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 15:16:30 -07:00
Brian Behlendorf f3ab88d646 Correctly lock pages for .readpages()
Unlike the .readpage() callback which is passed a single locked page
to be populated.  The .readpages() callback is passed a list of unlocked
pages which are all marked for read-ahead (PG_readahead set).  It is
the responsibly of .readpages() to ensure to pages are properly locked
before being populated.

Prior to this change the requested read-ahead pages would be updated
outside of the page lock which is unsafe.  The unlocked pages would then
be unlocked again which is harmless but should have been immediately
detected as bug.  Unfortunately, newer kernels failed detect this issue
because the check is done with a VM_BUG_ON which is disabled by default.
Luckily, the old Debian Lenny 2.6.26 kernel caught this because it
simply uses a BUG_ON.

The straight forward fix for this is to update the .readpages() callback
to use the read_cache_pages() helper function.  The helper function will
ensure that each page in the list is properly locked before it is passed
to the .readpage() callback.  In addition resolving the bug, this results
in a nice simplification of the existing code.

The downside to this change is that instead of passing one large read
request to the dmu multiple smaller ones are submitted.  All of these
requests however are marked for readahead so the lower layers should
issue a large I/O regardless.  Thus most of the request should hit the
ARC cache.

Futher optimization of this code can be done in the future is a perform
analysis determines it to be worthwhile.  But for the moment, it is
preferable that code be correct and understandable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #355
2011-08-08 13:24:52 -07:00
Brian Behlendorf 76659dc110 Add backing_device_info per-filesystem
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk.  The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance.  Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.

The replacement for pdflush is called bdi (backing device info).  The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback.  The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.

For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme.  However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.

Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.

However, there is one critical case where not implementing the bdi
functionality can cause problems.  If an application handles a page
fault it can enter the balance_dirty_pages() callpath.  This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.

Without a registered backing_device_info for the filesystem the
dirty pages will not get written out.  Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.

This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block.  It is then registered
when the filesystem mounted and unregistered on unmount.  It will
not be registered for mounted snapshots which are read-only.  This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2011-08-04 13:37:38 -07:00
Brian Behlendorf 3c0e5c0f45 Cleanup mmap(2) writes
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct.  In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache.  This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.

Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds.  However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.

This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean.  When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction.  The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.

This process is normally entirely asynchronous and will be repeated
for every dirty page.  This may initially sound inefficient but most
of these pages will end up in a few txgs.  That means when they are
eventually written to disk they should be nicely batched.  However,
there is room for improvement.  It may still be desirable to batch
up the pages in to larger writes for the dmu.  This would reduce
the number of callbacks and small 4k buffer required by the ARC.

Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set.  Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-02 10:34:55 -07:00
Martin Matuska cddafdcbc5 Illumos #1313: Integer overflow in txg_delay()
The function txg_delay() is used to delay txg (transaction group)
threads in ZFS.  The timeout value for this function is calculated
using:

    int timeout = ddi_get_lbolt() + ticks;

Later, the actual wait is performed:

    while (ddi_get_lbolt() < timeout &&
        tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
            (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
                timeout - ddi_get_lbolt());

The ddi_get_lbolt() function returns current uptime in clock ticks
and is typed as clock_t.  The clock_t type on 64-bit architectures
is int64_t.

The "timeout" variable will overflow depending on the tick frequency
(e.g. for 1000 it will overflow in 28.855 days). This will make the
expression "ddi_get_lbolt() < timeout" always false - txg threads will
not be delayed anymore at all. This leads to a slowdown in ZFS writes.

The attached patch initializes timeout as clock_t to match the return
value of ddi_get_lbolt().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #352
2011-08-01 12:09:43 -07:00
Alexander Stetsenko 0b7936d5c2 Illumos #278: get rid zfs of python and pyzfs dependencies
Remove all python and pyzfs dependencies for consistency and
to ensure full functionality even in a mimimalist environment.

Reviewed by: gordon.w.ross@gmail.com
Reviewed by: trisk@opensolaris.org
Reviewed by: alexander.r.eremin@gmail.com
Reviewed by: jerry.jelinek@joyent.com
Approved by: garrett@nexenta.com

References to Illumos issue and patch:
- https://www.illumos.org/issues/278
- https://github.com/illumos/illumos-gate/commit/1af68beac3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Issue #160

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-01 12:09:36 -07:00
Martin Matuska ca5252204a Illumos #1043: Recursive zfs snapshot destroy fails
Prior to revision 11314 if a user was recursively destroying
snapshots of a dataset the target dataset was not required to
exist.  The zfs_secpolicy_destroy_snaps() function introduced
the security check on the target dataset, so since then if the
target dataset does not exist, the recursive destroy is not
performed.  Before 11314, only a delete permission check on
the snapshot's master dataset was performed.

Steps to reproduce:
zfs create pool/a
zfs snapshot pool/a@s1
zfs destroy -r pool@s1

Therefore I suggest to fallback to the old security check, if
the target snapshot does not exist and continue with the destroy.

References to Illumos issue and patch:
- https://www.illumos.org/issues/1043
- https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Eric Schrock 3e31d2b080 Illumos #883: ZIL reuse during remount corruption
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place.  There is a very
good description of the issue and fix in Illumus #883.

Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Matt Ahrens f5fc4acaa7 Illumos #1092: zfs refratio property
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).

This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).

Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
George Wilson 6d974228ef Illumos #1051: zfs should handle imbalanced luns
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full.  It should instead fail quickly on the full LUNs and
move onto devices which have more availability.

Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Garrett D'Amore 2cc6c8db12 Illumos #175: zfs vdev cache consumes excessive memory
Note that with the current ZFS code, it turns out that the vdev
cache is not helpful, and in some cases actually harmful.  It
is better if we disable this.  Once some time has passed, we
should actually remove this to simplify the code.  For now we
just disable it by setting the zfs_vdev_cache_size to zero.
Note that Solaris 11 has made these same changes.

References to Illumos issue and patch:
- https://www.illumos.org/issues/175
- https://github.com/illumos/illumos-gate/commit/b68a40a845

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Gordon Ross ef3c1dea70 Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.

At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);

These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)

Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.

Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().

References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Tim Haley 7b8518cb8d Illumos #xxx: zdb -vvv broken after zfs diff integration
References to Illumos issue and patch:
- https://github.com/illumos/illumos-gate/commit/163eb7ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:02 -07:00
Brian Behlendorf beb9826902 Fix txg_sync_thread deadlock
Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE.
Because these functions are called from txg_sync_thread we
must ensure they don't reenter the zfs filesystem code via
the .writepage callback.  This would result in a deadlock.

This deadlock is rare and has only been observed once under
an abusive mmap() write workload.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-22 15:24:57 -07:00
Brian Behlendorf 22872ff5da Use zfs_mknode() to create dataset root
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies.  This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.

Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.

This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code.  There have been minor
modifications to accomidate the Linux VFS but that is all.

This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory.  This ensures it is created properly with
support for system attributes.  Previously it wasn't which
is why it behaved differently that all other directories
when modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #116
2011-07-20 19:52:26 -07:00
Brian Behlendorf 9fd91daeef Honor setgit bit on directories
Newly created files were always being created with the fsuid/fsgid
in the current users credentials.  This is correct except in the
case when the parent directory sets the 'setgit' bit.  In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory.  Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.

Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #262
2011-07-20 14:07:13 -07:00
Brian Behlendorf fe0ed8f910 Fix 'make install' overly broad 'rm'
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels.  This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against.  This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.

The fix for this issue is to only remove extraneous build products
when DESTDIR is set.  This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location.  Additionally, limit the removal the unneeded
build products to the target kernel version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #328
2011-07-20 09:38:51 -07:00
Brian Behlendorf cfc9a5c88f Fix zpl_writepage() deadlock
Disable the normal reclaim path for zpl_putpage().  This ensures that
all memory allocations under this call path will never enter direct
reclaim.  If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.

This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS.  However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.

The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.

Call Trace:
 [<ffffffff814dcdb2>] down_write+0x32/0x40
 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
 [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
 [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
 [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
 [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50
 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
 [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
 [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
 [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
 [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
 [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
 [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
 [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
 [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
 [<ffffffff81122521>] do_writepages+0x21/0x40
 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #327
2011-07-19 16:17:01 -07:00
Brian Behlendorf abd39a8289 Fix zio_execute() deadlock
To avoid deadlocking the system it is crucial that all memory
allocations performed in the zio_execute() call path are marked
KM_PUSHPAGE (GFP_NOFS).  This ensures that while a z_wr_iss
thread is processing the syncing transaction group it does
not re-enter the filesystem code and deadlock on itself.

Call Trace:
 [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl]
 [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs]
 [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs]
 [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs]
 [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff81159932>] kmem_getpages+0x62/0x170
 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160
 [<ffffffff8115b059>] __kmalloc+0x189/0x220
 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl]
 [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs]
 [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs]
 [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs]
 [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs]
 [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs]
 [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs]
 [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs]
 [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs]
 [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl]
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #291
2011-07-19 11:55:42 -07:00
Brian Behlendorf a140dc5469 Fix mmap(2)/write(2)/read(2) deadlock
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages.  Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.

The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case.  The individual pages will still be locked
thus serializing access.  Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do.  This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.

This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks.  It's up to each filesystem
to perform its own file locking.  Under Linux the VFS provides many
of these services.

It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
2011-07-19 11:55:42 -07:00
Brian Behlendorf 61f218b090 Fix send/recv 'dataset is busy' errors
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages.  As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block.  This
reference ensures the super block remains valid while it is in use.

To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference.  This indicates that the filesystem is
active and in use.

Unfortunately, in the 'zfs recv' case this is not the case.  The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.

The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero.  This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.

Closes #319
2011-07-15 16:37:19 -07:00
Brian Behlendorf 057e8eee35 Improve fstat(2) performance
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper.  However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date.  Unfortunately, this isn't
the case for the blksize, blocks, and atime fields.  At the
moment the authoritative values are still stored in the znode.

This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode.  At some
latter date we should be able to strictly use the inode values
and further improve performance.

The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex.  This overhead is
unavoidable until the inode is kept strictly up to date.  The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros.  These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation.  However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.

=================== Performance Tests ========================

This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop.  The test results
show the zfs stat(2) performance is now only 22% slower than
ext4.  This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.

filesystem    | test-1  test-2  test-3  | average | times-ext4
--------------+-------------------------+---------+-----------
ext4          |  7.785s  7.899s  7.284s |  7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat  |  9.224s  9.398s  9.485s |  9.369s | 1.223x

The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files.  The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache.  As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.

A little surprisingly the zfs cold cache performance is better
than ext4.  This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times.  By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.

filesystem    | cold    | hot
--------------+---------+--------
ext4          | 13.318s | 1.040s
zfs-0.6.0-rc4 |  4.982s | 1.762s
zfs-faststat  |  4.933s | 1.345s
2011-07-11 09:11:22 -07:00
Brian Behlendorf abd8610cd5 Add L2ARC tunables
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:

     l2arc_write_max         max write bytes per interval
     l2arc_write_boost       extra write bytes during device warmup
     l2arc_noprefetch        skip caching prefetched buffers
     l2arc_headroom          number of max device writes to precache
     l2arc_feed_secs         seconds between L2ARC writing
     l2arc_feed_min_ms       min feed interval in milliseconds
     l2arc_feed_again        turbo L2ARC warmup
     l2arc_norw              no reads during writes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #316
2011-07-08 12:44:11 -07:00
Gunnar Beutner 3c9609b322 Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE.
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.

On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Gunnar Beutner 46e18b3f0f Implemented sharing datasets via NFS using libshare.
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Brian Behlendorf 285226eff3 Always allow non-user xattrs
Under Linux you may only disable USER xattrs.  The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem.  The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.

Under Solaris there is only a single xattr namespace which
is managed globally.
2011-07-01 13:39:48 -07:00
Rohan Puri a89c3e0bd5 Support mandatory locks (nbmand)
The Linux kernel already has support for mandatory locking.  This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants.  In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS.  However, for now we'll leave it in place even
if it is redundant just in case we missed something.

The original patch to update the code to support mandatory locking
was done by Rohan Puri.  This patch is an updated version which is
compatible with the previous mount option handling changes.

Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #222
Closes #253
2011-07-01 13:39:40 -07:00
Brian Behlendorf 2cf7f52bc4 Linux compat 2.6.39: mount_nodev()
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure.  When using the new
interface the caller must now use the mount_nodev() helper.

Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers.  This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use.  It provides our only entry point
in to the namespace layer for manipulating certain mount options.

This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly.  It also allowed me
to keep more of the original Solaris code unmodified.  Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do.  However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem.  Thus keeping a back
reference from the filesystem to the namespace is complicated.

Rather than introduce some ugly hack to get the vfsmount and
continue as before.  I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.

This commit updates the code to remove this vfsmount back
reference entirely.  All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.

Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code.  This
change which fairly involved has turned out nicely.

Closes #246
Closes #217
Closes #187
Closes #248
Closes #231
2011-07-01 13:36:39 -07:00
Brian Behlendorf 5c03efc379 Linux compat 2.6.39: security_inode_init_security()
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.

Closes #246
Closes #217
Closes #187
2011-07-01 12:40:08 -07:00
Brian Behlendorf e2e7aa2df8 Add ZFS specific mmap() checks
Under Linux the VFS handles virtually all of the mmap() access
checks.  Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.

However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored.  Note, currently the code
to modify these attributes has not been implemented under Linux.

* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
  attributes are set a file may not be mmaped with write access.

* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
  read or exec access.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-01 12:23:46 -07:00
Brian Behlendorf f0b2486034 Remove unused MMAP functions
The following functions were required for the OpenSolaris mmap
implementation.  Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.

  * zfs_null_putapage()
  * zfs_frlock()
  * zfs_no_putpage()

Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
2011-07-01 12:22:57 -07:00
Prasad Joshi dde471ef5a MMAP Optimization
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.

ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.

The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
2011-07-01 12:22:52 -07:00
Prasad Joshi 218b8eafbd Use truncate_setsize in zfs_setattr
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().

zfs_setattr uses zfs_freesp to free the disk space belonging to the
file.  As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.

Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:52 -07:00
Prasad Joshi b312979252 Tear down and flush the mmap region
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.

The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:19 -07:00
Brian Behlendorf 7e7baecaa3 Linux 3.0 compat, shrinker compatibility
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated.  Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API.  See spl commit a55bcaad18.

This commit updates the ZFS code to use the slightly modified
API.  You must use the latest SPL if your building ZFS.
2011-06-21 14:36:39 -07:00
Gunnar Beutner b00131d43c Fix unlink/xattr deadlock
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:

1. File is created.
2. xattr is created for that file (behind the scenes a xattr
   directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
   inode from that directory are evicted by prune_icache();
   prune_icache() acquires a lock on both inodes before it
   calls ->evict() on the inodes

When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.

This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches

This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain().  Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.

To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained.  This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #266
2011-06-20 13:47:03 -07:00
Gunnar Beutner 6f0cf71e0d Removed erroneous zfs_inode_destroy() calls from zfs_rmnode().
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.

The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.

Fixes #282
2011-06-20 10:30:17 -07:00
Christian Kohlschütter df30f56639 Add "ashift" property to zpool create
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly.  This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes.  The drive may behave this way
for compatibility reasons.  For example, the WDC WD20EARS disks are
known to exhibit this behavior.

When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).

This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time.  This can significantly improve
performance in certain cases.  However, it will have an impact on your
total pool capacity.  See the updated ashift property description
in the zpool.8 man page for additional details.

Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior.  The most common ashift values are 9 and 12.

  Example:
  zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd

Closes #280

Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-06-17 16:35:49 -07:00
Brian Behlendorf 96801d2906 Linux 2.6.37 compat, WRITE_FLUSH_FUA
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER.  This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.

This change simply updates the bio's to use the new kernel API
which should be absolutely safe.  However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.

Closes #281
2011-06-17 14:37:26 -07:00
Brian Behlendorf e95b3bdcbb Fix stack ddt_class_contains()
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf 5b8c7bbcea Fix stack ddt_zap_lookup()
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf c7f8f831a4 Revert "Fix stack traverse_visitbp()"
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.

This reverts commit 6656bf5621.
2011-05-31 12:17:27 -07:00
Brian Behlendorf 2fac4c2a74 Make tgx_sync_thread zio's async
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import.  It is these recursive operations which contribute
greatly to the stack depth.  When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.

Previously to handle this case I have focused on keeping the
individual stack frames as light as possible.  This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code.  However, there is a better solution.

If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts.  This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O.  However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel.  That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.

With this change in place the worst case stack usage observed so
far is 6902 bytes.  This is still higher than I'd like but
significantly improved.  Additional changes to specific functions
should improve this further.  This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
2011-05-31 12:17:27 -07:00
Brian Behlendorf f74fae8b30 Fix 4K sector support
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux.  While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.

  sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
  cannot create 'large-sector': one or more devices is currently unavailable

1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors.  Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful.  To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes.  The higher
levels of ZFS want the value in bytes anyway so this is cleaner.

2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors.  The existing code would scale
the byte offset by the logical sector size.   Until now this was
always 512 so it never caused problems.  Trying a 4K sector drive
clearly exposed the issue.  The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.

With these changes I'm now able to create ZFS pools using 4K
sector drives.  No issues were observed during fairly extensive
testing.  This is also a low risk change if your using 512b
sectors devices because none of the logic changes.

Closes #256
2011-05-27 11:38:53 -07:00
Brian Behlendorf 2b8cad6159 Use vmem_alloc() for zfs_ioc_userspace_many()
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size.  In practice this works out
to exactly 27200 bytes.  Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
2011-05-20 14:23:18 -07:00
Brian Behlendorf f01b360e67 Pass caller's credential in zfsdev_ioctl()
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented.  So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.

However, one exception is quota handling which does require the
credential.  Now that proper credentials are supported we can
safely start passing the callers credential.  This is also an
initial step towards fully implemented the zfs secpolicy.
2011-05-20 10:12:25 -07:00
Brian Behlendorf 3fd70ee6b0 Fix 'negative objects to delete' warning
Normally when the arc_shrinker_func() function is called the return
value should be:

   >=0 - To indicate the number of freeable objects in the cache, or
   -1  - To indicate this cache should be skipped

However, when the shrinker callback is called with 'nr_to_scan' equal
to zero.  The caller simply wants the number of freeable objects in
the cache and we must never return -1.  This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.

This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected.  As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.

Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min.  This is done to prevent the Linux page cache from
completely crowding out the ARC.  This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.

Closes #218
Closes #243
2011-05-18 10:29:22 -07:00
Brian Behlendorf e814770f2e Update synchronous open zfs_close() comment
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux.  The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped.  This differs from Solaris where zfs_close()
is called for each close.

Closes #237
2011-05-13 08:20:06 -07:00
Brian Behlendorf c91d229809 Merge pull request #235 from nedbass/rdev
Don't store rdev in SA for FIFOs and sockets
2011-05-09 16:41:28 -07:00
Ned A. Bass aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf 21ade34764 Disable direct reclaim for z_wr_* threads
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.

  ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
  ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()

It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches.  To avoid
this we are forced to disable all reclaim for these threads.  It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.

Closes #232
2011-05-06 15:26:26 -07:00
Brian Behlendorf 3117dd0b90 Handle NULL in nfsd .fsync() hook
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels.  But basically there are three cases we need to
consider.

Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.

Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

For once it looks like we've gotten lucky.  The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely.  Since the dentry is still passed in both cases
this is possible.  The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.

Closes #230
2011-05-06 12:33:45 -07:00
Brian Behlendorf 34b84cb831 Use vmem_alloc() for zfs_ioc_pool_get_history()
The default buffer size when requesting history is 128k.  This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc().  This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
2011-05-06 09:59:52 -07:00
Brian Behlendorf c409e4647f Add missing ZFS tunables
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values.  However, in practice sometimes you do need to tweak these
values for one reason or another.  In those cases it's nice not to
have to resort to rebuilding from source.  All tunables are visable
to modinfo and the list is as follows:

$ modinfo module/zfs/zfs.ko
filename:       module/zfs/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAB1D71DACE05B5AA61567
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (bool)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm:           zfs_write_limit_min:Min tgx write limit (ulong)
parm:           zfs_write_limit_max:Max tgx write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm:           zfs_write_limit_override:Override tgx write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
2011-05-04 10:02:37 -07:00
Brian Behlendorf 5f35b19007 Fully update inode when created
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values.  This was done
for the znode and for most of the values in the inode, but not all
of them.  This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated.  This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.

Closes #116
Closes #146
Closes #164
2011-05-02 14:04:19 -07:00
Brian Behlendorf df554c148e Fix 'zfs set volsize=N pool/dataset'
This change fixes a kernel panic which would occur when resizing
a dataset which was not open.  The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.

The code has also been updated to correctly notify the kernel
when the block device capacity changes.  For 2.6.28 and newer
kernels the capacity change will be immediately detected.  For
earlier kernels the capacity change will be detected when the
device is next opened.  This is a known limitation of older
kernels.

Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0

Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes #68
Closes #84
2011-05-02 08:54:40 -07:00
Gunnar Beutner 055656d4f4 Implemented NFS export_operations.
Implemented the required NFS operations for exporting ZFS datasets
using the in-kernel NFS daemon.
2011-04-29 12:36:13 -07:00
Brian Behlendorf 5476e6952c Suppress 'vdev_metaslab_init' memory warning
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks.  However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
2011-04-27 09:35:18 -07:00
Brian Behlendorf 40a39e1103 Conserve stack in dsl_scan_visit()
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack.  This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap.  After this change dsl_scan_visit() has been reduced in size
by 320 bytes.

This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf b81c4ac9af Conserve stack in dsl_scan_visitbp()
This function is called recursively so everything possible must be
done to limit its stack consumption.  The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford.  By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon.  This yeilds a total
stack saving of 390 bytes on our 8k stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf 7a060636b0 Conserve stack in dsl_scan_visitbp()
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k).  The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode().  While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.

Issue #174
2011-04-26 13:37:35 -07:00
Brian Behlendorf 701b1f8168 Fix zvol deadlock
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group.  This results in the system
attempting to write out data to the disk to free memory.  Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk.  To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.

Closes #191
2011-04-26 12:56:35 -07:00
Brian Behlendorf e2448b0e62 Fix spurious -EFAULT when setting I/O scheduler
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev.  This was caused an improperly formatted
user mode helper command.

This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.

Closes #169
2011-04-22 14:55:35 -07:00
Brian Behlendorf 6a8f9b6bf0 Enforce ARC meta-data limits
This change ensures the ARC meta-data limits are enforced.  Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance.  The cache is aggressively
reclaimed but this is a soft and not a hard limit.  The cache may
exceed the set limit briefly before being brought under control.

By default 25% of the ARC capacity can be used for meta-data.  This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.

Closes #193
2011-04-21 13:49:31 -07:00
Gunnar Beutner 36df284366 Fixed a use-after-free bug in zfs_zget().
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #180
2011-04-21 13:48:01 -07:00
Brian Behlendorf d247f2a3cc Suppress 'zfs receive' memory warning
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size.  This sort of thing in general should
be avoided.  However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag.  This can be revisited latter if/when it becomes an issue.

Closes #178
2011-04-20 10:22:31 -07:00
Gunnar Beutner bec30953cd Truncate the xattr znode when updating existing attributes.
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #203
2011-04-19 14:14:40 -07:00
Gunnar Beutner 274b7e79f3 Added missing initialization for va.va_dentry in zfs_get_xattrdir.
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it.  This will result in the following BUG.  It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.

  kernel BUG at fs/dcache.c:1418!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #202
2011-04-19 13:57:41 -07:00
Brian Behlendorf 0fe3d820f5 Fix gcc compiler warning, dsl_pool_create()
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used.  This generates a
warning and a build failure when using --enable-debug.  However,
the code is correct we only want to use 'os' for the kernel space
builds.  To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL.  This was observed under Fedora 15.

  module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
  module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
  [-Werror=unused-but-set-variable]
2011-04-19 09:04:51 -07:00
Brian Behlendorf e30c0ada6d Linux 2.6.39 compat, invalidate_inodes()
Update code to use the spl_invalidate_inodes() wrapper.  This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype.  The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled.  By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
2011-04-19 08:57:23 -07:00
Brian Behlendorf 0d3ac5e735 Linux 2.6.29 compat, credentials
The .sync_fs fix as applied did not use the updated SPL credential
API.  This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API.  The original credential changes
are part of commit 81e97e2187.
2011-04-07 14:27:09 -07:00
Brian Behlendorf eec8164771 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Disable the normal reclaim path for the txg_sync thread.  This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim.  If this is allowed to
happen the system can deadlock.  Direct reclaim call path:

  ->shrink_icache_memory->prune_icache->dispose_list->
  clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
2011-04-07 09:52:16 -07:00
Brian Behlendorf 7cb67b45f3 Add direct+indirect ARC reclaim
Under OpenSolaris all memory reclaim is done asyncronously.  Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.

This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it.  The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.

This change leaves the arc_thread in place to manage the fundamental
ARC behavior.  But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed.  It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
2011-04-07 09:52:10 -07:00
Brian Behlendorf 1834f2d8b7 Add missing arcstats
The following useful values were missing the arcstats.  This change
adds them in to provide greater visibility in to the arcs behavior.

arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_meta_used                   4    624774592
arc_meta_limit                  4    400785408
arc_meta_max                    4    625594176
2011-04-07 09:52:05 -07:00
Brian Behlendorf c85b224faf Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked.  To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t.  There are
cases such as replay when a dentry is not available.  In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
2011-04-07 09:51:57 -07:00
Brian Behlendorf d433c20651 Fix `make distclean` for `./configure --with-config=user
Making distclean in module
    make[1]: Entering directory `/zfs/module'
    make -C  SUBDIRS=`pwd`  clean
    make: Entering an unknown directory
    make: *** SUBDIRS=/zfs/module: No such file or directory.  Stop.

When using --with-config=user the 'distclean' target would fail
because it assumes the kernel configuration infrastrure is set up.
This is not the case, nor does it need to be, because the
'--with-config=user' option will prune the entire ./module subtree
from SUBDIRS.  This prevents most build rules from operating in the
./module directory.

However, the 'dist*' rules will still traverse this directory
because it is listed in DIST_SUBDIRS.  This is correct because we
need to ensure the dist rules package the directory contents
regardless of the configuration for the 'dist' rule.  The correct
way to handle this is to only invoke the kernel build system as
part of the 'clean' rule when CONFIG_KERNEL_TRUE is set.

Initial fix provided by Darik Horn <dajhorn@vanadac.com>.
This commit is a slightly refined form of the original.
2011-04-05 13:33:28 -07:00
Brian Behlendorf bfd214af01 Fix inflated load average
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state.  These threads are usually in the process of performing IO and are
thus counted against the load average.  The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.

This change makes them sleep interruptibly.  Some care is required however
because these threads may now be woken early by signals.  In this case the
callers are all careful to check that the required conditions are met after
waking up.  If we're woken early due to a signal they will simply go back
to sleep.  In this case these changes are safe.

Closes #175
2011-03-31 17:07:12 -07:00
Brian Behlendorf 7a1cdc0775 Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
2011-03-22 12:17:24 -07:00
Brian Behlendorf 81e97e2187 Linux 2.6.29 compat, credentials
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct.  Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
2011-03-22 12:15:54 -07:00
Brian Behlendorf d6bd8eaae4 Fix evict() deadlock
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks.  These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously.  Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.

This commit addresses a deadlock caused by inode eviction.  A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held.  Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock.  To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.

This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated.  This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
2011-03-22 12:14:55 -07:00
Brian Behlendorf 691f6ac4c2 Use KM_PUSHPAGE instead of KM_SLEEP
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.

However, this increases the posibility of deadlocking the system
on a zfs write thread.  If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim.  This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.

To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread.  In this way forward progress in ensured.  The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock.  If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
2011-03-22 12:14:55 -07:00
Brian Behlendorf 0de19dad9c Register .remount_fs handler
Register the missing .remount_fs handler.  This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags.  However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags.  Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
2011-03-15 13:33:29 -07:00
Brian Behlendorf 03f9ba9d99 Register .sync_fs handler
Register the missing .sync_fs handler.  This is a noop in most cases
because the usual requirement is that sync just be initiated.  As part
of the DMU's normal transaction processing txgs will be frequently
synced.  However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk.  With the
addition of the .sync_fs handler this is now properly implemented.
2011-03-15 13:33:29 -07:00
Brian Behlendorf 04516a45b2 Don't set I/O Scheduler for Partitions
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk.  This is basically the same logic as
adjusting the write cache behavior on a disk.  This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.

Closes #152
2011-03-10 13:34:17 -08:00