Commit Graph

329 Commits

Author SHA1 Message Date
George Wilson c7f2d69de3 Illumos #1949, #1953
1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1949
  https://www.illumos.org/issues/1953

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:33:31 -07:00
Garrett D'Amore 3541dc6d02 Illumos #1748: desire support for reguid in zfs
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:08:56 -07:00
Brian Behlendorf 42d3b990cf Update incorrect ddt_zap_lookup() assertion
When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated.  The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
2012-07-03 15:14:34 -07:00
Etienne Dechamps b6ad9671ac Add ZIL statistics.
The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.

This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.

Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #786
2012-06-29 09:56:51 -07:00
Pawel Jakub Dawidek 0cee24064a Speed up 'zfs list -t snapshot -o name -s name'
FreeBSD #xxx:  Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:

  # zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot
properties.

Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:

    before:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 525s
        warm cache: 218s

    after:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 1.7s
        warm cache: 1.1s

NOTE: This patch only appears in FreeBSD.  If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
2012-06-14 09:49:04 -07:00
Darik Horn 74497b7ab6 Add zvol_inhibit_dev module option.
ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.

When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.

The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-06-13 17:05:16 -07:00
Etienne Dechamps ee191e802c Make zil_slog_limit a tunable module parameter.
zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.

The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #783
2012-06-12 08:45:53 -07:00
Richard Yao 6a0936babc Linux 3.4 compat, d_make_root() replaces d_alloc_root()
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776
2012-06-11 10:04:49 -07:00
Etienne Dechamps ab85f8455b Honor logbias when writing to ZVOLs.
The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #774
2012-06-11 09:43:48 -07:00
Brian Behlendorf 710114089f Revert "Disable direct reclaim on zvols"
This reverts commit ce90208cf9.  This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels.  This issue was filed as #710.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710
2012-04-30 14:26:49 -07:00
Brian Behlendorf b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Richard Yao ce90208cf9 Disable direct reclaim on zvols
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #342
2012-04-30 11:25:36 -07:00
Richard Yao 518b487602 Update ARC memory limits to account for SLUB internal fragmentation
23bdb07d4e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.

This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660
2012-04-30 10:04:34 -07:00
Brian Behlendorf 302f753f16 Integrate ARC more tightly with Linux
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem.  It would attempt to recognize low memory situtions
before they occured and evict data from the cache.  It would also
make assessments about if there was enough free memory to perform
a specific operation.

This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available.  To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces.  While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory.  This has manifested
itself in the past as a spinning arc_reclaim_thread.

This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface.  That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback.  The Linux VM will call this
function when it needs memory.  The ARC is then responsible for
attempting to free the requested amount of memory if possible.

Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.

* Removed the hdr_recl() reclaim callback which is redundant
  with the broader arc_shrinker_func().

* Reduced arc_grow_retry to 5 seconds from 60.  This is now used
  internally in the ARC with arc_no_grow to indicate that direct
  reclaim was recently performed.  This typically indicates a
  rapid change in memory demands which the kswapd threads were
  unable to keep ahead of.  As long as direct reclaim is happening
  once every 5 seconds arc growth will be paused to avoid further
  contributing to the existing memory pressure.  The more common
  indirect reclaim paths will not set arc_no_grow.

* arc_shrink() has been extended to take the number of bytes by
  which arc_c should be reduced.  This allows for a more granual
  reduction of the arc target.  Since the kernel provides a
  reclaim value to the arc_shrinker_func() this value is used
  instead of 1<<arc_shrink_shift.

* arc_reclaim_needed() has been removed.  It was used to determine
  if the system was under memory pressure and relied extensively
  on Solaris specific VM interfaces.  In most case the new code
  just checks arc_no_grow which indicates that within the last
  arc_grow_retry seconds direct memory reclaim occurred.

* arc_memory_throttle() has been updated to always include the
  amount of evictable memory (arc and page cache) in its free
  space calculations.  This space is largely available in most
  call paths due to direct memory reclaim.

* The Solaris pageout code was also removed to avoid confusion.
  It has always been disabled due to proc_pageout being defined
  as NULL in the Linux port.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-30 10:03:05 -07:00
Brian Behlendorf afec56b43f Add zfs_mdcomp_disable module option
Expose the zfs_mdcomp_disable variable as a module option.  This
can be used to disable compression of zfs meta data which is
enabled by default.  This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-27 16:28:02 -07:00
Brian Behlendorf ebf8e3a237 Illumos #1909: disk sync write perf regression when slog is used post oi_148
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererces to Illumos issue:
  https://www.illumos.org/issues/1909

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #680
2012-04-19 16:26:29 -07:00
Prakash Surya 409dc1a570 Use KM_PUSHPAGE in l2arc_write_buffers
There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:

    crash> bt 4123
    PID: 4123   TASK: ffff88062f8c1500  CPU: 6   COMMAND: "l2arc_feed"
      0 [ffff88062511d610] schedule at ffffffff814eeee0
      1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
      2 [ffff88062511d748] mutex_lock at ffffffff814f041b
      3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
      4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
      5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
      6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
      7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
      8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
      9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
     10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
     11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
     12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
     13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
     14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
     15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
     16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
     17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
     18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
     19 [ffff88062511dee8] kthread at ffffffff81090a86
     20 [ffff88062511df48] kernel_thread at ffffffff8100c14a

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-17 11:56:21 -07:00
Martin Matuska 7d5cd71da6 Illumos #1346: zfs incremental receive may leave behind temporary clones
1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1346
  https://www.illumos.org/issues/1356

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #647
2012-04-11 12:02:27 -07:00
Albert Lee 22cd4a4653 Illumos #1475: zfs spill block hold can access invalid spill blkptr
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1475

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648
2012-04-11 11:46:30 -07:00
George Wilson 5ffb9d1d05 Illumos #1951: leaking a vdev when removing an l2cache device
1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References to Illumos issues:
  https://www.illumos.org/issues/1951
  https://www.illumos.org/issues/1952
  https://www.illumos.org/issues/1954

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #650
2012-04-11 11:32:06 -07:00
Martin Matuska b129c6590e OS-926: zfs panic in zfs_fill_zplprops_impl()
This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling.  This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #652
2012-04-11 11:29:19 -07:00
Andriy Gapon 3adfc400f5 Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_tsd
vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

References to Illumos issue:
  https://www.illumos.org/issues/1680

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #655
2012-04-11 11:23:18 -07:00
Brian Behlendorf f0fd83be65 Export additional dsl symbols
Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions.  They allow
us to cleanly register a callback which is called when a
dataset property is modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-11 09:26:55 -07:00
Gunnar Beutner 1f0d8a566f Fixed a NULL pointer dereference bug in zfs_preumount
When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.

This bug can be reproduced using this shell script:

 #!/bin/sh
 (
 while true; do
 	zfs create -o mountpoint=legacz tank/bar
 	zfs destroy tank/bar
 done
 ) &

 (
 while true; do
 	mount -t zfs tank/bar /mnt
 	umount /mnt
 done
 ) &

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #639
2012-04-05 11:29:42 -07:00
Brian Behlendorf fc41c6402b Properly expose the mfu ghost list kstats
Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats.  This was harmless but
confusing since memory usage could be over reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-27 15:08:22 -07:00
Brian Behlendorf 1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf 99ea23c583 Enhance a dmu_tx_dirty_buf() assertion
The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.

  SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
  ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:24:05 -07:00
Brian Behlendorf 4b5d425f14 Add ZFS_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:14:35 -07:00
Brian Behlendorf 9ed86e7cc7 Account for .zfs ctldir inodes
Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used.  This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().

In my option there's a kernel bug here too which allows this to
happen.  They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #617
2012-03-22 15:43:55 -07:00
Brian Behlendorf ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Brian Behlendorf 49be0ccf1f Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-21 14:51:44 -07:00
Brian Behlendorf c8df41538d Revert "Add zio constructor/destructor"
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1e07.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604
2012-03-21 14:51:01 -07:00
Brian Behlendorf 77a405ae52 Add missing NULL in zpl_xattr_handlers
The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated.  This NULL was
accidentally missing which could result in a NULL dereference.

Interestingly this issue only manifested itself on certain 32-bit
systems.  Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594
2012-03-15 15:18:29 -07:00
Brian Behlendorf 0ece356db5 Add sa_spill_rele() interface
Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle.  This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold.  Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-07 16:28:00 -08:00
Brian Behlendorf 2c6d0b1e07 Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-07 16:06:23 -08:00
Brian Behlendorf ec2626ad3f Use SA_HDL_PRIVATE for SA xattrs
A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit().  If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made.  This is done to
prevent data from leaking in to the syncing txg.  As a result
the original dirty spill block will remain cached.

Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache.  It's far safer and simpler just to use a private
handle for xattrs.  Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.

These forever dirty buffers can be noticed in the arcstats under
the anon_size.  On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase.  Eventually, if enough dirty data builds up
your system it will appear to hang.  This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.

As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513
2012-03-02 13:20:48 -08:00
Brian Behlendorf 570827e129 Add 'dmu_tx' kstats entry
Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg.  This can be useful
when attempting to determine why a particular workload is
under performing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:59:10 -08:00
Brian Behlendorf 13be560d89 Add arc_state_t stats to arcstats
To ensure the arc is behaving properly we need greater visibility
in to exactly how it's managing the systems memory.  This patch
takes one step in that direction be adding the current arc_state_t
for the anon, mru, mru_ghost, mfu, and mfs_ghost lists.  The l2
arc_state_t is already well represented in the arcstats.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:58:59 -08:00
Alex Zhuravlev a473d90cee Export symbols for zero-copy
Export additional symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-17 12:43:02 -08:00
Richard Yao b41c9906dc Support ashift=13 for 8KB SSD block sizes
New SSDs are now available which use an internal 8k block size.
To make sure ZFS can get the maximum performance out of these
devices we're increasing the maximum ashift to 13 (8KB).

This value is still small enough that we can fit 16 uberblocks
in the vdev ring label.  However, I don't want to increase this
any futher or it will limit the ability the safely roll back a
pool to recover it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565
2012-02-13 12:25:27 -08:00
Brian Behlendorf b10c77f70a Export symbols for zero-copy
Exported the required symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-10 11:56:55 -08:00
Brian Behlendorf a31acb462d Use spl_debug_* helpers
When configuring the spl debug log support use the provided wrapper
functions.  This ensures that if --disable-debug-log was used when
buiding the spl the functions will have no effect.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:37:48 -08:00
Etienne Dechamps 30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps aec69371a6 Check permissions in zfs_space().
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:37 -08:00
Etienne Dechamps 5cb63a57f8 Implement the truncate_range() inode operation.
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:32 -08:00
Etienne Dechamps dde9380a1b Use 32 as the default number of zvol threads.
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.

This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:

> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.

We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.

On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.

We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392
2012-02-08 13:58:10 -08:00
Etienne Dechamps 34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps 56c34bac44 Support "sync=always" for ZVOLs.
Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #374.
2012-02-07 16:23:06 -08:00