Commit Graph

1439 Commits

Author SHA1 Message Date
Brian Behlendorf 3fd70ee6b0 Fix 'negative objects to delete' warning
Normally when the arc_shrinker_func() function is called the return
value should be:

   >=0 - To indicate the number of freeable objects in the cache, or
   -1  - To indicate this cache should be skipped

However, when the shrinker callback is called with 'nr_to_scan' equal
to zero.  The caller simply wants the number of freeable objects in
the cache and we must never return -1.  This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.

This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected.  As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.

Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min.  This is done to prevent the Linux page cache from
completely crowding out the ARC.  This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.

Closes #218
Closes #243
2011-05-18 10:29:22 -07:00
Alexey Shvetsov d9bfe0f57a Fix distribution detection for gentoo
Also this may fix other distros because some of them also provide
/etc/lsb-release not only ubuntu.

Closes #244
2011-05-14 08:54:48 -07:00
Brian Behlendorf e814770f2e Update synchronous open zfs_close() comment
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux.  The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped.  This differs from Solaris where zfs_close()
is called for each close.

Closes #237
2011-05-13 08:20:06 -07:00
Alexey Shvetsov 6f582dc708 Remove root 'ls' after mount workaround
This workaround was introduced to workaround issue #164.  This
issue was fixed by commit 5f35b19 so the workaround can be safely
dropped from both the zfs.fedora and zfs.gentoo init scripts.
2011-05-12 15:01:35 -07:00
Alexey Shvetsov 06abcdd3f4 Fix zfs.gentoo init script logic
* Fix zfs.ko module check
* Check 'zfs umount -a' return value
2011-05-12 14:45:57 -07:00
Alexey Shvetsov 04c22478a7 Make zfs.gentoo init script more gentoo style.
* Improved compatibility with openrc
* Removed LOCKFILE
* Improved checksystem() function
* Remove /etc/mtab check for /
* General cleanup
2011-05-12 14:42:43 -07:00
Brian Behlendorf c91d229809 Merge pull request #235 from nedbass/rdev
Don't store rdev in SA for FIFOs and sockets
2011-05-09 16:41:28 -07:00
Ned A. Bass aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf 21ade34764 Disable direct reclaim for z_wr_* threads
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.

  ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
  ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()

It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches.  To avoid
this we are forced to disable all reclaim for these threads.  It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.

Closes #232
2011-05-06 15:26:26 -07:00
Brian Behlendorf 3117dd0b90 Handle NULL in nfsd .fsync() hook
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels.  But basically there are three cases we need to
consider.

Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.

Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

For once it looks like we've gotten lucky.  The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely.  Since the dentry is still passed in both cases
this is possible.  The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.

Closes #230
2011-05-06 12:33:45 -07:00
Brian Behlendorf 6ee44e32be Fix awk usage
The zpool_id and zpool_layout helper scripts have been updated to
use the more common /usr/bin/awk symlink.  On Fedora/Redhat systems
there are both /bin/awk and /usr/bin/awk symlinks to your installed
version of awk.  On Debian/Ubuntu systems only the /usr/bin/awk
symlink exists.

Additionally, add the '\<' token to the beginning of the regex
pattern to prevent partial matches.  This pattern only appears to
work with gawk despite the mawk man page claiming to support this
extended regex.  Thus you will need to have gawk installed to use
these optional helper scripts.  A comment has been added to the
script to reflect this reality.
2011-05-06 10:16:04 -07:00
Brian Behlendorf 34b84cb831 Use vmem_alloc() for zfs_ioc_pool_get_history()
The default buffer size when requesting history is 128k.  This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc().  This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
2011-05-06 09:59:52 -07:00
Brian Behlendorf 3613204cd7 Allow mounting of read-only snapshots
With the addition of the mount helper we accidentally regressed
the ability to manually mount snapshots.  This commit updates
the mount helper to expect the possibility of a ZFS_TYPE_SNAPSHOT.
All snapshot will be automatically treated as 'legacy' type mounts
so they can be mounted manually.
2011-05-05 10:13:38 -07:00
Brian Behlendorf c409e4647f Add missing ZFS tunables
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values.  However, in practice sometimes you do need to tweak these
values for one reason or another.  In those cases it's nice not to
have to resort to rebuilding from source.  All tunables are visable
to modinfo and the list is as follows:

$ modinfo module/zfs/zfs.ko
filename:       module/zfs/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAB1D71DACE05B5AA61567
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (bool)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm:           zfs_write_limit_min:Min tgx write limit (ulong)
parm:           zfs_write_limit_max:Max tgx write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm:           zfs_write_limit_override:Override tgx write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
2011-05-04 10:02:37 -07:00
Brian Behlendorf 8db77dd7ed Prep zfs-0.6.0-rc4 tag
Create the fourth 0.6.0 release candidate tag (rc4).
2011-05-03 10:29:05 -07:00
Brian Behlendorf 712f8bd87b Add Gentoo/Lunar/Redhat Init Scripts
Every distribution has slightly different requirements for their
init scripts.  Because of this the zfs package contains several
init scripts for various distributions.  These scripts have been
contributed by, and are supported by, the larger zfs community.
Init scripts for Gentoo/Lunar/Redhat have been contributed by:

  Gentoo - devsk <devsku@gmail.com>
  Lunar  - Jean-Michel Bruenn <jean.bruenn@ip-minds.de>
  Redhat - Fajar A. Nugraha <list@fajar.net>
2011-05-02 15:59:13 -07:00
Brian Behlendorf 5f35b19007 Fully update inode when created
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values.  This was done
for the znode and for most of the values in the inode, but not all
of them.  This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated.  This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.

Closes #116
Closes #146
Closes #164
2011-05-02 14:04:19 -07:00
Brian Behlendorf df554c148e Fix 'zfs set volsize=N pool/dataset'
This change fixes a kernel panic which would occur when resizing
a dataset which was not open.  The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.

The code has also been updated to correctly notify the kernel
when the block device capacity changes.  For 2.6.28 and newer
kernels the capacity change will be immediately detected.  For
earlier kernels the capacity change will be detected when the
device is next opened.  This is a known limitation of older
kernels.

Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0

Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes #68
Closes #84
2011-05-02 08:54:40 -07:00
Alejandro R. Sedeño e90a3de3e8 Add zpl_export.c to the list of targets. 2011-04-29 19:50:35 -07:00
Brian Behlendorf 94e954257a Correct MAXUID
The uid_t on most systems is in fact and unsigned 32-bit value.
This is almost always correct, however you could compile your
kernel to use an unsigned 16-bit value for uid_t.  In practice
I've never encountered a distribution which does this so I'm
willing to overlook this corner case for now.

Closes #165
2011-04-29 14:03:12 -07:00
Gunnar Beutner 055656d4f4 Implemented NFS export_operations.
Implemented the required NFS operations for exporting ZFS datasets
using the in-kernel NFS daemon.
2011-04-29 12:36:13 -07:00
Brian Behlendorf 5476e6952c Suppress 'vdev_metaslab_init' memory warning
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks.  However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
2011-04-27 09:35:18 -07:00
Brian Behlendorf 40a39e1103 Conserve stack in dsl_scan_visit()
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack.  This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap.  After this change dsl_scan_visit() has been reduced in size
by 320 bytes.

This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf b81c4ac9af Conserve stack in dsl_scan_visitbp()
This function is called recursively so everything possible must be
done to limit its stack consumption.  The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford.  By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon.  This yeilds a total
stack saving of 390 bytes on our 8k stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf 7a060636b0 Conserve stack in dsl_scan_visitbp()
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k).  The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode().  While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.

Issue #174
2011-04-26 13:37:35 -07:00
Brian Behlendorf 44e9e34793 Merged pull request #212 from dajhorn/hostid.
Use gethostid in the Linux convention.
2011-04-26 13:30:27 -07:00
Brian Behlendorf 701b1f8168 Fix zvol deadlock
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group.  This results in the system
attempting to write out data to the disk to free memory.  Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk.  To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.

Closes #191
2011-04-26 12:56:35 -07:00
Darik Horn 492b8e9e7b Use gethostid in the Linux convention.
Disable the gethostid() override for Solaris behavior because Linux systems
implement the POSIX standard in a way that allows a negative result.

Mask the gethostid() result to the lower four bytes, like coreutils does in
/usr/bin/hostid, to prevent junk bits or sign-extension on systems that have an
eight byte long type. This can cause a spurious hostid mismatch that prevents
zpool import on 64-bit systems.
2011-04-25 10:36:17 -05:00
Brian Behlendorf a1cc0b3290 Fix 32-bit MAXOFFSET_T definition
Having MAXOFFSET_T defined to 0x7fffffffl was artificially limiting
the maximum file size on 32-bit systems.  In reality MAXOFFSET_T is
used when working with 'long long' types and as such we now define
it as LLONG_MAX.  This resolves the 2GB file size limit for files
and additionally allows zvols greater than 2GB on 32-bit systems.

Closes #136
Closes #81
2011-04-22 16:21:26 -07:00
Brian Behlendorf e2448b0e62 Fix spurious -EFAULT when setting I/O scheduler
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev.  This was caused an improperly formatted
user mode helper command.

This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.

Closes #169
2011-04-22 14:55:35 -07:00
Brian Behlendorf 6a8f9b6bf0 Enforce ARC meta-data limits
This change ensures the ARC meta-data limits are enforced.  Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance.  The cache is aggressively
reclaimed but this is a soft and not a hard limit.  The cache may
exceed the set limit briefly before being brought under control.

By default 25% of the ARC capacity can be used for meta-data.  This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.

Closes #193
2011-04-21 13:49:31 -07:00
Gunnar Beutner 36df284366 Fixed a use-after-free bug in zfs_zget().
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #180
2011-04-21 13:48:01 -07:00
Brian Behlendorf d247f2a3cc Suppress 'zfs receive' memory warning
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size.  This sort of thing in general should
be avoided.  However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag.  This can be revisited latter if/when it becomes an issue.

Closes #178
2011-04-20 10:22:31 -07:00
Aniruddha Shankar 9caef54224 Added required runlevel info for init on Debian
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #208
2011-04-20 09:55:57 -07:00
Brian Behlendorf 3fce1d0962 Update zconfig.sh to use new zvol names
This change should have occured when we commited the new udev
rules for zvols.  Basically, the test script is just out of date.
We need to update it to use the /dev/zvol/ device names, and
to expect the more common -partN suffixes.

I added a udev_trigger() call in zconfig_partition() and
zconfig_zvol_device_stat() to ensure that all the udev rules have
run before.  This ensures the devices are available to subsequent
commands and closes a small race.

Finally, I was forced added a small 'sleep 1' to test 10.  I
was observing occassional failures in my VM due to the device
still claiming to be busy.  Delaying betwen the various methods
of adding/removing a vdev avoids the issue.

Closes #207
2011-04-19 16:33:41 -07:00
Brian Behlendorf 18f4238277 Add parted and lsscsi dependencies to zfs-test
The zfault.sh and zconfig.sh test scripts requires the parted
utility, the lsscsi utility, and the scsi_debug module.  To
ensure the utilities are available they have been added as
dependencies to zfs-test package.  Checking for scsi_debug
is a little more problematic because if it's missing you will
need to build it.  For clarity the documention has been updated
to mention this.

Closes #205
Closes #206
2011-04-19 15:22:46 -07:00
Brian Behlendorf cbb1e401f9 Add Gunnar Beutner to AUTHORS for his contributions 2011-04-19 14:14:51 -07:00
Gunnar Beutner bec30953cd Truncate the xattr znode when updating existing attributes.
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #203
2011-04-19 14:14:40 -07:00
Gunnar Beutner 274b7e79f3 Added missing initialization for va.va_dentry in zfs_get_xattrdir.
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it.  This will result in the following BUG.  It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.

  kernel BUG at fs/dcache.c:1418!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #202
2011-04-19 13:57:41 -07:00
Richard Laager 826ab7ad19 Support IEC base-2 prefixes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 12:56:21 -07:00
Richard Laager 4da4a9e1a7 Cleanup various Sun/Solaris-isms
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:42:42 -07:00
Richard Laager 251eb26d17 Update the version in the zpool upgrade example
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:40:45 -07:00
Richard Laager 3b2041509f Normalize the deferred destruction language
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:40:26 -07:00
Richard Laager 0d122e21ff Improve the wording about hot spares
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:39:37 -07:00
Richard Laager 6b92390fe8 Improve some quoting consistency
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:38:56 -07:00
Richard Laager 577468215b Remove a stray tab
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:38:27 -07:00
Richard Laager 25d4782bac Linux has "partitions", not "slices"
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:38:04 -07:00
Richard Laager 54e5f2264d Use Linux disk names in zpool.8
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:37:31 -07:00
Richard Laager 1dc3fea59e More and correct an example in zpool.8
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:36:17 -07:00
Richard Laager 1fe2e23771 Change /dev/dsk -> /dev in the man pages
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-19 11:35:21 -07:00