Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().
zfs_setattr uses zfs_freesp to free the disk space belonging to the
file. As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.
Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.
Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.
Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated. Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API. See spl commit a55bcaad18.
This commit updates the ZFS code to use the slightly modified
API. You must use the latest SPL if your building ZFS.
While the splat tests were originally designed to stress test
the Solaris primatives. I am extending them to include some kernel
compatibility tests. Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.
Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory. These are simply functional tests to
ensure we can call these functions safely. Checking for correct
behavior is more difficult since other running processes will
influence the behavior. However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.
Test 3 checks that shrinker functions are being registered and
called correctly. As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
Update the the wrapper macros for the memory shrinker to handle
this 4th API change. The callback function now takes a
shrink_control structure. This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:
1. File is created.
2. xattr is created for that file (behind the scenes a xattr
directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
inode from that directory are evicted by prune_icache();
prune_icache() acquires a lock on both inodes before it
calls ->evict() on the inodes
When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.
This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches
This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.
To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained. This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#266
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.
The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.
Fixes#282
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER. This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.
This change simply updates the bio's to use the new kernel API
which should be absolutely safe. However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.
Closes#281
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.
This reverts commit 6656bf5621.
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import. It is these recursive operations which contribute
greatly to the stack depth. When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.
Previously to handle this case I have focused on keeping the
individual stack frames as light as possible. This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code. However, there is a better solution.
If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts. This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O. However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel. That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.
With this change in place the worst case stack usage observed so
far is 6902 bytes. This is still higher than I'd like but
significantly improved. Additional changes to specific functions
should improve this further. This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux. While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.
sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
cannot create 'large-sector': one or more devices is currently unavailable
1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors. Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful. To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes. The higher
levels of ZFS want the value in bytes anyway so this is cleaner.
2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors. The existing code would scale
the byte offset by the logical sector size. Until now this was
always 512 so it never caused problems. Trying a 4K sector drive
clearly exposed the issue. The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.
With these changes I'm now able to create ZFS pools using 4K
sector drives. No issues were observed during fairly extensive
testing. This is also a low risk change if your using 512b
sectors devices because none of the logic changes.
Closes#256
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size. In practice this works out
to exactly 27200 bytes. Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented. So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.
However, one exception is quota handling which does require the
credential. Now that proper credentials are supported we can
safely start passing the callers credential. This is also an
initial step towards fully implemented the zfs secpolicy.
Normally when the arc_shrinker_func() function is called the return
value should be:
>=0 - To indicate the number of freeable objects in the cache, or
-1 - To indicate this cache should be skipped
However, when the shrinker callback is called with 'nr_to_scan' equal
to zero. The caller simply wants the number of freeable objects in
the cache and we must never return -1. This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.
This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected. As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.
Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min. This is done to prevent the Linux page cache from
completely crowding out the ARC. This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.
Closes#218Closes#243
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux. The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped. This differs from Solaris where zfs_close()
is called for each close.
Closes#237
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute. While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets. Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.
Closes#216
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.
->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()
It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches. To avoid
this we are forced to disable all reclaim for these threads. It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.
Closes#232
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs. To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels. But basically there are three cases we need to
consider.
Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.
Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
For once it looks like we've gotten lucky. The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely. Since the dentry is still passed in both cases
this is possible. The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.
Closes#230
The default buffer size when requesting history is 128k. This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc(). This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values. This was done
for the znode and for most of the values in the inode, but not all
of them. This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated. This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.
Closes#116Closes#146Closes#164
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks. However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack. This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap. After this change dsl_scan_visit() has been reduced in size
by 320 bytes.
This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.
Issue #174
This function is called recursively so everything possible must be
done to limit its stack consumption. The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford. By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon. This yeilds a total
stack saving of 390 bytes on our 8k stack.
Issue #174
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k). The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode(). While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.
Issue #174
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group. This results in the system
attempting to write out data to the disk to free memory. Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk. To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.
Closes#191
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev. This was caused an improperly formatted
user mode helper command.
This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.
Closes#169
This change ensures the ARC meta-data limits are enforced. Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance. The cache is aggressively
reclaimed but this is a soft and not a hard limit. The cache may
exceed the set limit briefly before being brought under control.
By default 25% of the ARC capacity can be used for meta-data. This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.
Closes#193
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#180
Change the SPL kernel messages for module loading and module
unloading so that they are similar to the ZFS kernel messages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 1814251453.
Demote the gawk call back to awk and ensure that stderr is attached. GNU gawk
tolerates a missing stderr handle, but many utilities do not, which could be
why a regular awk call was unexplainably failing on some systems.
Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:
$ modprobe spl spl_hostid=0x12345678
Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.
Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.
Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K). The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.
For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments. When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through. When it's not we assume the worst case and
a maximally sized workspace is used.
The path_lookup() function has been renamed to kern_path_parent()
and the flags argument has been removed. The only behavior now
offered is that of LOOKUP_PARENT. The spl already always passed
this flag so dropping the flag does not impact us.
This is a long over due compatibility change. Way, way, way back
in 2007 there was a push to remove all consumers of SPIN_LOCK_UNLOCKED.
Finally, in 2011 with 2.6.39 all the consumers have been updated
and SPIN_LOCK_UNLOCKED was removed. It's about time we use the
new API as well, this change does exactly that. DEFINE_SPINLOCK()
was available as far back as 2.6.12 so there doesn't need to be
any additional autoconf-foo for this change.
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size. This sort of thing in general should
be avoided. However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag. This can be revisited latter if/when it becomes an issue.
Closes#178
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#203
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it. This will result in the following BUG. It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.
kernel BUG at fs/dcache.c:1418!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#202
Flagged by the default -Wunused-but-set-variable gcc option when
running under Fedora 15. Since it's correct this variable is
entirely unused this commit removes it.
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes(). This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block. When only the legacy API is available the second
argument will be dropped for compatibility.
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used. This generates a
warning and a build failure when using --enable-debug. However,
the code is correct we only want to use 'os' for the kernel space
builds. To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL. This was observed under Fedora 15.
module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
[-Werror=unused-but-set-variable]
Update code to use the spl_invalidate_inodes() wrapper. This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype. The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled. By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
The .sync_fs fix as applied did not use the updated SPL credential
API. This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API. The original credential changes
are part of commit 81e97e2187.
Disable the normal reclaim path for the txg_sync thread. This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim. If this is allowed to
happen the system can deadlock. Direct reclaim call path:
->shrink_icache_memory->prune_icache->dispose_list->
clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
Under OpenSolaris all memory reclaim is done asyncronously. Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.
This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it. The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.
This change leaves the arc_thread in place to manage the fundamental
ARC behavior. But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed. It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
The following useful values were missing the arcstats. This change
adds them in to provide greater visibility in to the arcs behavior.
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_meta_used 4 624774592
arc_meta_limit 4 400785408
arc_meta_max 4 625594176
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked. To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t. There are
cases such as replay when a dentry is not available. In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache. After the entries are
pruned any slabs which they may have been using are reaped.
Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count. However, in practice this doesn't need to be
exactly correct. We simply need to reclaim some useful fraction
(but not all) of the cache. The caller can determine if more
needs to be done.
One of the most common things you want to know when looking at
the slab is how much memory is being used. This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.
slab_kmem_total - Total kmem slab size
slab_kmem_avail - Alloc'd kmem slab size
slab_kmem_max - Max observed kmem slab size
slab_vmem_total - Total vmem slab size
slab_vmem_avail - Alloc'd vmem slab size
slab_vmem_max - Max observed vmem slab size
NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
The 'slab_fail', 'slab_create', and 'slab_destroy' columns in the slab
output have been removed because they are virtually always zero and
not very useful.
The much more useful 'size' and 'alloc' columns have been added which
show the total slab size and how much of the total size has been
allocated to objects.
Finally, the formatting has been updated to be much more human
readable while still being friendly for tool like awk to parse.
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header. The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.
SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
Making distclean in module
make[1]: Entering directory `/zfs/module'
make -C SUBDIRS=`pwd` clean
make: Entering an unknown directory
make: *** SUBDIRS=/zfs/module: No such file or directory. Stop.
When using --with-config=user the 'distclean' target would fail
because it assumes the kernel configuration infrastrure is set up.
This is not the case, nor does it need to be, because the
'--with-config=user' option will prune the entire ./module subtree
from SUBDIRS. This prevents most build rules from operating in the
./module directory.
However, the 'dist*' rules will still traverse this directory
because it is listed in DIST_SUBDIRS. This is correct because we
need to ensure the dist rules package the directory contents
regardless of the configuration for the 'dist' rule. The correct
way to handle this is to only invoke the kernel build system as
part of the 'clean' rule when CONFIG_KERNEL_TRUE is set.
Initial fix provided by Darik Horn <dajhorn@vanadac.com>.
This commit is a slightly refined form of the original.
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state. These threads are usually in the process of performing IO and are
thus counted against the load average. The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.
This change makes them sleep interruptibly. Some care is required however
because these threads may now be woken early by signals. In this case the
callers are all careful to check that the required conditions are met after
waking up. If we're woken early due to a signal they will simply go back
to sleep. In this case these changes are safe.
Closes#175
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do. To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.
Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code. This simplification
means we only have one version of the helper to keep to to date.
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct. Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks. These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously. Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.
This commit addresses a deadlock caused by inode eviction. A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock. To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.
This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated. This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.
However, this increases the posibility of deadlocking the system
on a zfs write thread. If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim. This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.
To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread. In this way forward progress in ensured. The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock. If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
As part of vmalloc() a __pte_alloc_kernel() allocation may occur. This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur. This reclaim can trigger file IO which
can result in a deadlock. This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.
An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3). This code can be properly autoconf'ed
when the upstream bug is fixed.
1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
Register the missing .remount_fs handler. This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags. However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags. Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
Register the missing .sync_fs handler. This is a noop in most cases
because the usual requirement is that sync just be initiated. As part
of the DMU's normal transaction processing txgs will be frequently
synced. However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk. With the
addition of the .sync_fs handler this is now properly implemented.
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk. This is basically the same logic as
adjusting the write cache behavior on a disk. This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.
Closes#152
Due to an uninitialized variable files opened with O_APPEND may
overwrite the start of the file rather than append to it. This
was introduced accidentally when I removed the Solaris vnodes.
The zfs_range_lock_writer() function used to key off zf->z_vnode
to determine if a znode_t was for a zvol of zpl object. With
the removal of vnodes this was replaced by the flag zp->z_is_zvol.
This flag was used to control the append behavior for range locks.
Unfortunately, this value was never properly initialized after
the vnode removal. However, because most of memory is usually
zeros it happened to be set correctly most of the time making
the bug appear racy. Properly initializing zp->z_is_zvol to
zero completely resolves the problem with O_APPEND.
Closes#126
Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize
stack space usage. These two arrays consumed 448 bytes on the stack
and have been replaced by two 8 byte points for a total stack space
saving of 432 bytes. The zfs_setattr() path had been previously
observed to overrun the stack in certain circumstances.
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux. In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep. Thus the following cv_destroy()
will free memory which may still be in use.
This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock. The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock. This worked well but using the taskq for the free
resulted in a serious performace hit. This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.
This patch refines the original fix and moves the free from the
taskq to a private free list. Then items which must be free'd
are simply inserted in to the list. When the range lock is dropped
it's safe to free the items. The list is walked and all rl_t
entries are freed.
This change improves small cached read performance by 26x. This
was expected because for small reads the number of locking calls
goes up significantly. More surprisingly this change significantly
improves large cache read performance. This probably attributable
to better cpu/memory locality. Very likely the same processor
which allocated the memory is now freeing it.
bs ext3 zfs zfs+fix faster
----------------------------------------------
512 435 3 79 26x
1k 820 7 160 22x
2k 1536 14 305 21x
4k 2764 28 572 20x
8k 3788 50 1024 20x
16k 4300 86 1843 21x
32k 4505 138 2560 18x
64k 5324 252 3891 15x
128k 5427 276 4710 17x
256k 5427 413 5017 12x
512k 5427 497 5324 10x
1m 5427 521 5632 10x
Closes#142
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity. This was functional but not 100%
correct with the expected ZFS sematics. Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.
1) The ZFS_APPENDONLY file attribute is once again honored. While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.
2) Minimal virus scan file attribute hooks were added. Once again
this support in disabled but the infrastructure is back in place.
3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log. Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
Filesystems like ZFS must use what the kernel calls an anonymous super
block. Basically, this is just a filesystem which is not backed by a
single block device. Normally this block device's dev_t is stored in
the super block. For anonymous super blocks a unique reserved dev_t
is assigned as part of get_sb().
This sb->s_dev must then be set in the returned stat structures as
stat->st_dev. This allows userspace utilities to easily detect the
boundries of a specific filesystem. Tools such as 'du' depend on this
for proper accounting.
Additionally, under OpenSolaris the statfs->f_fsid is set to the device
id. To preserve consistency with OpenSolaris we also set the fsid to
the device id. Other Linux filesystem (ext) set the fsid to a unique
value determined by the filesystems uuid. This value is unique but
maintains no relationship to the device id. This may be desirable
when exporting NFS filesystem because it minimizes to chance of a
client observing the same fsid from two different servers.
Closes#140
The AT_ versions of these macros are used on Solaris and while they
map to their Linux equivilants the code has been updated to use the
ATTR_ versions.
Move 'tmpxvattr' from the stack to the heap to minimize stack
space usage. This is enough to get us below the 1024 byte stack
frame warning. That however is still a large stack frame and it
should be further reduced by moving the 'bulk' and 'xattr_bulk'
sa_bulk_attr_t variables to the heap in a future patch.
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go. One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible. They
would be replaced with their Linux equivalents. This would not only
be good for performance, but for the general readability and health of
the code. The Solaris and Linux VFS are different beasts and should
be treated as such. Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.
This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t. There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR. But it didn't look that hard to come
back soon and replace it all with a native Linux type.
However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.
Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things. This commit reverts many of
my previous commits which removed xvattr related code. It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.
The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working. However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types. For now that's
a price I'm willing to pay. Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.
Closes#111
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines. This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.
This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support. By removing it from the spl we ensure not conflict
with the higher level packages.
This just leaves minimal vnode support for basical manipulation
of files. This code is does have the proper support functions
in the spl and a set of regression tests.
Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
Print the supported zpool and filesystem versions at module load
time. This change removes an ambiguity and adds information that
system administrators care about. The phrase "ZFS pool version %s"
is the same as zpool upgrade -v so that the operator is familiar
with the message.
ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5
ZFS: Unloaded module v0.6.0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress. The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place. In reality the current implementation
was non-functional, it just was compilable.
The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use. A kmem_cache was
added for the workspace area because we require a large chunk
of memory. This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow. Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release. A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.
The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler. In
these cases attempting to set a scheduler is pointless and can
be safely skipped.
The secord case is slightly more troubling. We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible. To handle that case we silently retry
up to three times before reporting the warning.
In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.
Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
(include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
(include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
/dev/[dataset_name] symlink. For partitions on zvol, it will create
/dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
cmd/zvol_id).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove custom code to pack/unpack dev_t's. Under Linux all dev_t's
are an unsigned 32-bit value even on 64-bit platforms. The lower
20 bits are used for the minor number and the upper 12 for the major
number.
This means if your importing a pool from Solaris you may get strange
major/minor numbers. But it doesn't really matter because even if
we add compatibility code to translate the encoded Solaris major/minor
they won't do you any good under Linux. You will still need to
recreate the dev_t with a major/minor which maps to reserved major
numbers used under Linux.
Dropping this code also resolves 32-bit builds by removing the
offending 32-bit compatibility code.
ASSERT3P should be used instead of ASSERT3U when comparing
pointers. Using ASSERT3U with the cast causes a compiler
warning for 32-bit builds which is fatal with --enable-debug.
The underlying storage pool actually uses multiple block
size. Under Solaris frsize (fragment size) is reported as
the smallest block size we support, and bsize (block size)
as the filesystem's maximum block size. Unfortunately,
under Linux the fragment size and block size are often used
interchangeably. Thus we are forced to report both of them
as the filesystem's maximum block size.
Closes#112
Because the secpolicy_* macros are all currently defined to (0).
And because the caller of this function does not check the return
code. The compiler complains that this statement has no effect
which is correct and OK. To suppress the warning explictly cast
the result to (void).
Generally it's a good idea to use enums for switch statements,
but in this case it causes warning because the enum is really a
set of flags. These flags are OR'ed together in some cases
resulting in values which are not part of the original enum.
This causes compiler warning such as this about invalid cases.
error: case value ‘33’ not in enumerated type ‘zprop_source_t’
To handle this we simply case the enum to an int for the switch
statement. This leaves all other enum type checking in place
and effectively disabled these warnings.
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules. This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.
Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address. The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn. Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address. All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.
Long term we should try to get this, or another, interface made
available to modules again.
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers. While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header. This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations. This is not a
functional change or bug fix, it is just code cleanup.
When changing the uid/gid of a file via zfs_setattr() use the
Posix id passed in iattr->ia_uid/gid. While the zfs_fuid_create()
code already had the fuid support disabled for Linux it was
returning the uid/gid from the credential. With this change
the 'chown' command which relies on setxattr is now working
properly.
Also remove a little stray white space which was in front of
zfs_update_inode() call and the end of zfs_setattr().
Under Linux sys_symlink(2) should result in a inode being created
with one reference for the inode itself, and a second reference on
the inode which is held by the new dentry. Under Solaris this
appears not to be the case. Their zfs_symlink() handler drops
the inode reference before returning.
The result of this under Linux is that the reference count for
symlinks is always one smaller than it should have been. This
results in a BUG() when the symlink is unlinked. To handle this
the Linux port now keeps the inode reference which differs from
the Solaris behavior. This results in correct reference counts.
Closes#96
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
This patch addresses three issues related to symlinks.
1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink(). The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t. The
caller zpl_follow_link() was also updated accordingly. This
change was reverted because it was slightly gratuitious.
2) Update zpl_follow_link() to use local variables for the
link buffer. I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used. We need our own private copy of the link pointer.
3) Allocate MAXPATHLEN instead of MAXPATHLEN+1. By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page. That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing. This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors. See lustre.org
bug 23931.
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported. This patch registers these additional callbacks
for both directory and special inode types.
Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA. This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.
This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.
It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().
Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset. Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb(). This callback function
is solely used for this purpose so surpressing the error is safe.
Closes#96
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods. The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added. The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
The fsync() callback in the file_operations structure used to take
3 arguments. The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers. To
handle this a compatibility prototype was added to ensure the right
prototype is used. Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.
Initial testing has shown the the right IO scheduler to use under Linux
is noop. This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization. While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device. This yields the largest possible requests for the
device with the lowest total overhead.
While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option. You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory. In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
The following warning was observed under normal operation. It's
not fatal but it's something to be addressed long term. Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.
SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
[<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
[<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
[<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
[<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
[<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
[<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
[<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
[<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
[<ffffffff8116d905>] vfs_read+0xb5/0x1a0
[<ffffffff8116da41>] sys_read+0x51/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b
When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache. If we don't there will
stale cache entries which when accessed will result in EIOs.
With the recent SPL change (d599e4fa) that forces cv_destroy()
to block until all waiters have been woken. It is now unsafe
to call cv_destroy() under the zp->z_range_lock() because it
is used as the condition variable mutex. If there are waiters
cv_destroy() will block until they wake up and aquire the mutex.
However, they will never aquire the mutex because cv_destroy()
will not return allowing it's caller to drop the lock. Deadlock.
To avoid this cv_destroy() is now run asynchronously in a taskq.
This solves two problems:
1) It is no longer run under the zp->z_range_lock so no deadlock.
2) Since cv_destroy() may now block we don't want this slowing
down zfs_range_unlock() and throttling the system.
This was not as much of an issue under OpenSolaris because their
cv_destroy() implementation does not do anything. They do however
risk a bad paging request if cv_destroy() returns, the memory holding
the condition variable is free'd, and then the waiters wake up and
try to reference it. It's a very small unlikely race, but it is
possible.
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.
The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC. This has been shown to work
well for the common read(2)/write(2) case. However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache. To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache. The code is careful
to keep both copies synchronized.
When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated. For a read(2) data will be read first from the page
cache then the ARC if needed. Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.
New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region. These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage(). This will occur due to either a sync or the usual
page aging behavior. Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.
While this implementation ensures correct behavior it does have
have some drawbacks. The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files. It also adds additional complexity to the code keeping
both caches synchronized.
Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers. The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index. The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both. It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.
Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy. However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
The Linux specific xattr operations have all been located in the
file zpl_xattr.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
The Linux specific super block operations have all been located in the
file zpl_super.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
The Linux specific inode operations have all been located in the
file zpl_inode.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
The Linux specific file operations have all been located in the
file zpl_file.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks. In also
adds all the new zpl_* file to the Makefile.in. This is not a
standalone commit, you required the following zpl_* commits.
For the moment exactly how to handle xvattr is not clear. This
change largely consists of the code to comment out the offending
bits until something reasonable can be done.
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset. This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it. As such I'll just summerize the intent
of the changes which are all (or partly) in this commit. Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants. More specifically:
1) Replace all instances of zfsvfs_t with zfs_sb_t. While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this. While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.
2) Replace vnode_t with struct inode. The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one. There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date. The code has been updated
to remove all need for this type.
3) Replace all vtype_t's with umode types. Along with this shift
all uses of types to mode bits. The Solaris code would pass a
vtype which is redundant with the Linux mode. Just update all the
code to use the Linux mode macros and remove this redundancy.
4) Remove using of vn_* helpers and replace where needed with
inode helpers. The big example here is creating iput_aync to
replace vn_rele_async. Other vn helpers will be addressed as
needed but they should be be emulated. They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.
5) Update znode alloc/free code. Under Linux it's common to
embed the inode specific data with the inode itself. This removes
the need for an extra memory allocation. In zfs this information
is called a znode and it now embeds the inode with it. Allocators
have been updated accordingly.
6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.
This will be the first and last of these to large to review commits.
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object. It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS. We are committing to this to simply the
filesystem creations code.
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement. These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered. As the comment says 'data integrity is job one'. This
change could have been done by defining panicstr to oops_in_progress
in the SPL. But I felt it was better to use the native Linux API
here since to be clear.
This flag does not need to be support under Linux. As the comment
says it was only there to support fsflush() for old filesystem like
UFS. This is not needed under Linux.
Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer. Honoring those mount
options once set if of course the responsibility of the lower
layers.
This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted. Once again the generic
Linux VFS handles this case for us so it can be removed.
The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer. They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.
Under Linux we don't need to reserve a major or minor number for
the filesystem. We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.
Additionally, there is no need to keep a zfsfstype around. We are
not limited on Linux by the OpenSolaris infrastructure which needed
this. The upper zpl layer can specify the filesystem type.
The ZFS code is being restructured to act as a library and a stand
alone module. This allows us to leverage most of the existing code
with minimal modification. It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
For the moment we have left ZFS unchanged and it updates many values
as part of the znode. However, some of these values should be set
in the inode. For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.
This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode. At
which point zfs_update_inode() can be dropped entirely. Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure. This differs
significantly from Solaris.
Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS. This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode(). This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
Basic compilation of the bulk of zfs_znode.c has been enabled. After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces. The following commits
will systematically replace update the requiter interfaces. There
are of course pros and cons to this decision.
Pros:
* This simplifies intergration with Linux in the long term. There is
no longer any need to manage vnodes which are a foreign concept to
the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.
Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux. While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep. This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common(). Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file. This way I don't
need to reimplement this functionality from scratch in the SPL.
Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed. If
that turns out to be the case they can then be removed.
Remove unneeded bootfs functions. This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
Certain NFS/SMB share functionality is not yet in place. These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled. I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL. So I'm just
renaming the wrapper here to HAVE_SHARE. They still won't be
compiled until all the share issues are worked through. Share
support is the last missing piece from zfs_ioctl.c.
The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.
Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode. So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches. Until proven that this is a problem it's
just added complexity we don't need.
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios. However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters. However, I've now seen several instances in
OpenSolaris code where they do the following:
cv_broadcast();
cv_destroy();
This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT. This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request. But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.
Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled. This may take some time because each waiter must
acquire the mutex.
This change may have an impact on performance for frequently
created and destroyed condition variables. That however is a price
worth paying it avoid crashing your system. If performance issues
are observed they can be addressed by the caller.
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking. This causes processes
to go in the D state which increases the load average. The load
average is the summation of processes in D state and run queue.
To avoid this it can be desirable to sleep interruptibly. These
processes do not count against the load average but may be woken by
a signal. It is up to the caller to determine why the process
was woken it may be for one of three reasons.
1) cv_signal()/cv_broadcast()
2) the timeout expired
3) a signal was received
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem. This avoids the need to have to ugly
up the code with the required #define's at every call site. At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
During a rename we need to be careful to destroy and create a
new minor for the ZVOL _only_ if the rename succeeded. The previous
code would both destroy you minor device unconditionally, it would
also fail to create the new minor device on success.
These compiler warnings were introduced when code which was
previously #ifdef'ed out by HAVE_ZPL was re-added for use
by the posix layer. All of the following changes should be
obviously correct and will cause no semantic changes.
The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early. Under Linux this counts against the load
average keeping it artificially high. This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.
Normally this means some extra care must be taken to handle a potential
signal. But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test. This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.
The test will first create 32 keys via tsd_create() and register
a common destructor. Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread. Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.
The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys. They then immediately
exit. This is is designed to verify tsd_exit() which will be
called via thread_exit(). This must result in all registered
destructors being run and the memory for the tsd being free'd.
After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd. At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.
If this all goes off without a hitch the test passes. To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels. This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.
The majority of the entries in the hash table are for specific tsd
entries. These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.
The hash table contains two additional type of entries. They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create(). It is used to store the address of the destructor function
and it is used as an anchor point. All tsd entries which use the same
key will be linked to this entry. This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.
The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key. The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it. This
list is using during tsd_exit() to ensure all registered destructors
are run for the process. The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid. Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
For debugging purposes the condition varaibles keep track of the
mutex used during a wait. The idea is to validate that all callers
always use the same mutex. Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe. My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex. However, there is overhead in doing this and it does
appear to be allowed under Solaris.
To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped. This ensures that while the condition variable
is in use the incorrect mutex case is detected. It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.
Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant. The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex. It turns out that was not the case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit fixes a sign extension bug affecting l2arc devices. Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to
attempt to access beyond end of device
sdbi1: rw=14, want=36028797014862705, limit=125026959
The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater. To avoid this, we store
the offset in a uint64_t.
This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future. We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.
Closes#66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the ZFS code only contains support
back to 2.6.18 vintage kernels, I'm not adding an autoconf check
for this and simply moving everything to use fops->unlocked_ioctl().
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags. The new flag is called REQ_SYNC. To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function. Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.
Preferred interface for flagging a synchronous bio:
2.6.12-2.6.29: BIO_RW_SYNC
2.6.30-2.6.35: BIO_RW_SYNCIO
2.6.36-2.6.xx: REQ_SYNC
Commit 3ee56c292b changed an ENOTSUP return value
in one location to ENOTSUPP to fix user programs seeing an invalid ioctl()
error code. However, use of ENOTSUP is widespread in the zfs module. Instead
of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be
consistent with user space. The changed return value in the above commit is
therefore no longer needed, so this commit reverses it to maintain consistency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.
module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
enumerated type ‘krw_type_t’
Support for rolling back datasets require a functional ZPL, which we currently
do not have. The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device. To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is
consistent with the behavior of other ZPL dependent commands such as mount.
The orginal error message observed with this bug was rather confusing:
internal error: Unknown error 524
Aborted
This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code. It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space. With that we would have gotten
the somewhat more helpful message
cannot rollback 'tank/fish': unsupported version
This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL. But, this change updates the error code just in
case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Increasing the default zio_wr_int thread count from 8 to 16 improves
write performence by 13% on large systems. More testing need to be
done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum
of 8 threads.
Linux kernel thread names are expected to be short. This change shortens
the zio thread names to 10 characters leaving a few chracters to append
the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.
On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL
file pointer if the user tries to mount a zvol without a filesystem on it.
This change adds checks to prevent a null pointer dereference.
Closes#73.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped. This was discovered while writing the zfault.sh
tests to validate common failure modes.
This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough. In this case 1024 byte is the
default. If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t. This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error. Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.
The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit. This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event. Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.
However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events. With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
All the upper layers of zfs expect zio->io_error to be positive. I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error. To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().
Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:
ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
Observed during failure mode testing, dsl_scan_setup_sync() allocates
73920 bytes. This is way over the limit of what is wise to do with a
kmem_alloc() and it should probably be moved to a slab. For now I'm
just flagging it with KM_NODEBUG to quiet the error until this can be
revisited.
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created. The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output. It is also used
to determine how writeback cache should be set for a device.
When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label(). The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc(). Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.
We also add an ASSERT that the whole_disk property is set. We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set. The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing. If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This callback is needed for properly accounting the per-uid and per-gid
space usage. Even if we don't have the ZPL, we still need this callback
in order to have proper on-disk ZPL compatibility and to be able to use
Lustre quotas.
Fortunately, the callback doesn't have any ZPL/VFS dependencies so we
can just move it out of #ifdef HAVE_ZPL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Required for the DB_DNODE_ENTER()/DB_DNODE_EXIT() helpers.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it
causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG
to quiet down this known large alloc.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices. Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The upstream ZFS code has correctly moved to a faster native sha2
implementation. Unfortunately, under Linux that's going to be a little
problematic so we revert the code to the more portable version contained
in earlier ZFS releases. Using the native sha2 implementation in Linux
is possible but the API is slightly different in kernel version user
space depending on which libraries are used. Ideally, we need a fast
implementation of SHA256 which builds as part of ZFS this shouldn't be
that hard to do but it will take some effort.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This branch contains the majority of the changes required to cleanly
intergrate with Linux style special devices (/dev/zfs). Mainly this
means dropping all the Solaris style callbacks and replacing them
with the Linux equivilants.
This patch also adds the onexit infrastructure needed to track
some minimal state between ioctls. Under Linux it would be easy
to do this simply using the file->private_data. But under Solaris
they apparent need to pass the file descriptor as part of the ioctl
data and then perform a lookup in the kernel. Once again to keep
code change to a minimum I've implemented the Solaris solution.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ZFS update to onnv_141 brought with it support for a
security label attribute called mlslabel. This feature
depends on zones to work correctly and thus I am disabling
it under Linux. Equivilant functionality could be added
at some point in the future.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch leverages the Solaris style FMA call points
in ZFS to create a user space visible event notification system
under Linux. This new system is called zevent and it unifies
all previous Solaris style ereports and sysevent notifications.
Under this Linux specific scheme when a sysevent or ereport event
occurs an nvlist describing the event is created which looks almost
exactly like a Solaris ereport. These events are queued up in the
kernel when they occur and conditionally logged to the console.
It is then up to a user space application to consume the events
and do whatever it likes with them.
To make this possible the existing /dev/zfs ABI has been extended
with two new ioctls which behave as follows.
* ZFS_IOC_EVENTS_NEXT
Get the next pending event. The kernel will keep track of the last
event consumed by the file descriptor and provide the next one if
available. If no new events are available the ioctl() will block
waiting for the next event. This ioctl may also be called in a
non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the
non-blocking case if no events are available ENOENT will be returned.
It is possible that ESHUTDOWN will be returned if the ioctl() is
called while module unloading is in progress. And finally ENOMEM
may occur if the provided nvlist buffer is not large enough to
contain the entire event.
* ZFS_IOC_EVENTS_CLEAR
Clear are events queued by the kernel. The kernel will keep a fairly
large number of recent events queued, use this ioctl to clear the
in kernel list. This will effect all user space processes consuming
events.
The zpool command has been extended to use this events ABI with the
'events' subcommand. You may run 'zpool events -v' to output a
verbose log of all recent events. This is very similar to the
Solaris 'fmdump -ev' command with the key difference being it also
includes what would be considered sysevents under Solaris. You
may also run in follow mode with the '-f' option. To clear the
in kernel event queue use the '-c' option.
$ sudo cmd/zpool/zpool events -fv
TIME CLASS
May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync
class = "ereport.fs.zfs.config.sync"
ena = 0x40982b7897700001
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0xed976600de75dfa6
(end detector)
time = 0x4bec8bc3 0x2e5aed98
pool = "zpios"
pool_guid = 0xed976600de75dfa6
pool_context = 0x0
While the 'zpool events' command is handy for interactive debugging
it is not expected to be the primary consumer of zevents. This ABI
was primarily added to facilitate the addition of a user space
monitoring daemon. This daemon would consume all events posted by
the kernel and based on the type of event perform an action. For
most events simply forwarding them on to syslog is likely enough.
But this interface also cleanly allows for more sophisticated
actions to be taken such as generating an email for a failed drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add autoconf style build infrastructure to the ZFS tree. This
includes autogen.sh, configure.ac, m4 macros, some scripts/*,
and makefiles for all the core ZFS components.
Due to limited stack space recursive functions are frowned upon in
the Linux kernel. However, they often are the most elegant solution
to a problem. The following code preserves the recursive function
traverse_visitbp() but moves the local variables AND function
arguments to the heap to minimize the stack frame size. Enough
space is initially allocated on the stack for 20 levels of recursion.
This change does ugly-up-the-code but it reduces the worst case
usage from roughly 4160 bytes to 960 bytes on x86_64 archs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Implement zio_execute() as a wrapper around the static function
__zio_execute() so that we can force __zio_execute() to be inlined.
This reduces stack overhead which is important because __zio_execute()
is called recursively in several zio code paths. zio_execute() itself
cannot be inlined because it is externally visible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Eliminated local variables pointing to members of the zio struct.
Just refer to the struct members directly. This saved about 32 bytes per
call, but this function can be called recurisvely up to 19 levels deep,
so we potentially save up to 608 bytes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively. This patch flags all
such functions I've found as 'noinline' to prevent gcc from making
the optimization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce kernel stack usage by lzjb_compress() by moving uint16 array
off the stack and on to the heap. The exact performance implications
of this I have not measured but we absolutely need to keep stack
usage to a minimum. If/when this becomes and issue we optimize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Decrease stack usage for various call paths by forcing certain
functions to be inlined. By inlining the functions the overhead
of a new stack frame is removed at the cost of increased code size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To reduce stack overhead this topic branch moves the 128 byte
blkptr_t data strucutre in dsl_scan_visitbp() to the heap.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce stack usage in dsl_deleg_get, gcc flagged it as consuming a
whopping 1040 bytes or potentially 1/4 of a 4K stack. This patch
moves all the large structures and buffer off the stack and on to
the heap. This includes 2 zap_cursor_t structs each 52 bytes in
size, 2 zap_attribute_t structs each 280 bytes in size, and 1
256 byte char array. The total saves on the stack is 880 bytes
after you account for the 5 new pointers added.
Also the source buffer length has been increased from MAXNAMELEN
to MAXNAMELEN+strlen(MOS_DIR_NAME)+1 as described by the comment in
dsl_dir_name(). A buffer overrun may have been possible with the
slightly smaller buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move dsl_dataset_t local variable from the stack to the heap.
This reduces the stack usage of this function from 2048 bytes
to 176 bytes for x84_64 arches.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce stack usage by 276 bytes by moving the snaparg struct from the
stack to the heap. We have limited stack space we must not waste.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit preserves the recursive function dbuf_hold_impl() but moves
the local variables and function arguments to the heap to minimize
the stack frame size. Enough space is initially allocated on the
stack for 20 levels of recursion. This technique was based on commit
34229a2f2a which reduced stack usage of
traverse_visitbp().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The dnode_move() functionality is only used in the kernel build.
As such we should be careful to wrap all of the related code
with '#ifdef _KERNEL' to avoid gcc warnings about unused code.
Interestingly this looks like an upstream bug as well. If for some
reason we are unable to get a zvols statistics, because perhaps the
zpool is hopelessly corrupt, we would trigger the VERIFY. This
commit adds the proper error handling just to propagate the error
back to user space. Now the user space tools still must handle this
properly but in the worst case the tool will crash or perhaps have
some missing output. That's far far better than crashing the host.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The zio_taskq_dispatch() function may be called at interrupt time
and it is critical that we never sleep.
Additionally, wrap taskq_dispatch() in a while loop because it may
fail. This is non optimal but is OK for now.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Do not use zmod.h in userspace.
This has also been filed with the ZFS team. It makes the userspace
libzpool code use the zlib API, instead of the Solaris-only and
non-standard zmod.h. The zlib API is almost identical and is a de
facto standard, so this is a no-brainer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If your only going to allow one allocator to be used and it is defined
at compile time there is no point including the others in the build.
This patch could/should be refined for Linux to make the metaslab
configurable at run time. That might be a bit tricky however since
you would need to quiese all IO. Short of that making it configurable
as a module load option would be a reasonable compromise.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove all instances of list handling where the API is not used
and instead list data members are directly accessed. Doing this
sort of thing is bad for portability.
Additionally, ensure that list_link_init() is called on newly
created list nodes. This ensures the node is properly initialized
and does not rely on the assumption that zero'ing the list_node_t
via kmem_zalloc() is the same as proper initialization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move xiou stat structures from a header to the dmu.c source as is
done with all the other kstat interfaces. This information is local
to dmu.c registered the xuio kstat and should stay that way.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Replace non-fatal assertion with warning. This was being observed
during testing and it should not be fatal.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the linux kernel 'current' is defined to mean the current process
and can never be used as a local variable in a function. Simply
replace all usage of 'current' with 'curr' in this function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The upstream commit cb code had a few bugs:
1) The arguments of the list_move_tail() call in txg_dispatch_callbacks()
were reversed by mistake. This caused the commit callbacks to not be
called at all.
2) ztest had a bug in ztest_dmu_commit_callbacks() where "error" was not
initialized correctly. This seems to have caused the test to always take
the simulated error code path, which made ztest unable to detect whether
commit cbs were being called for transactions that successfuly complete.
3) ztest had another bug in ztest_dmu_commit_callbacks() where the commit
cb threshold was not being compared correctly.
4) The commit cb taskq was using 'max_ncpus * 2' as the maxalloc argument
of taskq_create(), which could have caused unnecessary delays in the txg
sync thread.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO. In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow. This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The
original mask is then restored at vn_close() time. Hats off to the
loop driver which does something similiar for the same reason.
[...]
shrink_slab+0xdc/0x153
try_to_free_pages+0x1da/0x2d7
__alloc_pages+0x1d7/0x2da
do_generic_mapping_read+0x2c9/0x36f
file_read_actor+0x0/0x145
__generic_file_aio_read+0x14f/0x19b
generic_file_aio_read+0x34/0x39
do_sync_read+0xc7/0x104
vfs_read+0xcb/0x171
:spl:vn_rdwr+0x2b8/0x402
:zfs:vdev_file_io_start+0xad/0xe1
[...]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.
However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.
One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.
A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.
Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
In cmd/splat.c there was a comparison between an __u32 and an int. To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.
In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
Commit 55abb0929e removed the never
used format1 argument of spl_debug_msg(). That in turn resulted
in some deadcode which should be removed since it's now useless.
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility. If splat terminates before the threads, accesses to that memory
location by the other threads become invalid. Splat synchronizes with
the threads with the call:
wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:
wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this. That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.
Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests. Much to my surprise the unsigned
64-bit division regression tests failed.
This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel. This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed. After some investigation this turned out to be exactly
the case.
Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl. There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.
The update implementation now passed all the unsigned and signed
regression tests. This should be functional, but not fast, which is
good enough for out purposes. If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform. I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
For some reason when awk invoked by the usermode helper the command
always fails. Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk. Anyway, the simplest
thing to do here is to just make gawk mandatory. I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change. I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25. This means
builds against 2.6.25-2.6.26 kernels were broken.
To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
Use 3 threads and 8 tasks. Dispatch the final 3 tasks with TQ_FRONT.
The first three tasks keep the worker threads busy while we stuff the
queues. Use msleep() to force a known execution order, assuming
TQ_FRONT is properly honored. Verify that the expected completion
order occurs.
The splat_taskq_test5_order() function may be useful in more than
one test. This commit generalizes it by renaming the function to
splat_taskq_test_order() and adding a name argument instead of
assuming SPLAT_TASKQ_TEST5_NAME as the test name.
The documentation for splat taskq regression test #5 swaps the two required
completion orders in the diagram. This commit corrects the error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
On open() and initialize the buffer with the SPL version string. The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.
Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed. The upstream code has been updated
to depend entirely on the the procname member. To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels. Older kernels are
supported by having it expand to .ctl_name = X just as before.
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus. This can result in incorrect behavior from
mutex_owned() and mutex_owner(). This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.
Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.
Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit(). When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().
Finally, improve the mutex regression tests. For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread. This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.
As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited. This basic thread implementation
was correct, this was simply a flaw in the test case.
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
This fix was long overdue. Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call. However,
probably due to lack of time at the moment that informatin never
made it in to the error message. This patch fixes that and trys
to standardize the kmem debug messages as well.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes. This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris. Minor updates were
required to all the call sites where it was included of course.
Running 'zpool create' on a 32-bit machine with an SPL compiled with
gcc 4.4.4 led to a stack overlow. This turned out to be due to some
sort of 'optimization' by gcc:
uint64_t __umoddi3(uint64_t dividend, uint64_t divisor)
{
return dividend - divisor * (dividend / divisor);
}
This code was supposed to be using __udivdi3 to implement /, but gcc
instead implemented it via __umoddi3 itself.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove RW_COUNT() from the rwlock implementation. The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock. While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes. In
short it hasn't been worth the trouble.
With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions. As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation. It remains to be determined if its
worth the effort of adding a custom implementation for each arch.
While I may prefer to have the system panic on an SBUG and to get
crash dump for analysis. I suspect most peoples systems are not
configured from crash dump and the best thing to so is to simply
halt the thread and print an error to the console. This way they
have a good chance of actually saving the stack trace and debug log.
Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN. This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.
This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise. Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.
Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.
Using /tmp/ is a preferable default, it can always be overriden
using the module option on a case-by-case basis.
Additionally standardize some log messages based on the same
default log level used by the kernel.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
While this does incur slightly more overhead we should be using
do_posix_clock_monotonic_gettime() for gethrtime() as described
by the existing comment.
This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux. The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it. This makes it impossible to woken by a signal
such as SIGTERM. The cv_wait_interruptible() function was added
to handle this case. It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up. This means you do need to be careful and check issig() after
waking.
When dumping a debug log first check that it is safe to create
a new thread and block waiting for it. If we are in an atomic
context or irqs and disabled it is not safe to sleep and we
must write out of the debug log from the current process.
During module init spl_setup()->The vn_set_pwd("/") was failing
with -EFAULT because user_path_dir() and __user_walk() both
expect 'filename' to be a user space address and it's not in
this case. To handle this the data segment size is increased
to to ensure strncpy_from_user() does not fail with -EFAULT.
Additionally, I've added a printk() warning to catch this and
log it to the console if it ever reoccurs. I thought everything
was working properly here because there consequences of this
failing are subtle and usually non-critical.
We need dependent packages to be able to include spl_config.h to
build properly. This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed. This solution
works as long as the spl_config.h is included before your projects
config.h. That turns out to be easier said than done. In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.
To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command. This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file. This eliminates
the problem entirely and makes header safe for inclusion.
Also in this change I have removed the few places in the code where
spl_config.h is included. It is now added to the gcc compile line
to ensure the config results are always available.
Finally, I have also disabled the verbose kernel builds. If you
want them back you can always build with 'make V=1'. Since things
are working now they don't need to be on by default.
Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks. To help prvent this limit
max kmem based slab size to MAX_ORDER-3. Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype. It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.
I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions. It's not pretty but API compat changes rarely are.
Fix panic() string, which was being used as a format string, instead of an already-formatted string.
Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
This test case verifies the correct behavior of taskq_wait_id().
In particular it ensure the the following two cases are handled
properly:
1) Task ids larger than the waited for task id can run and
complete as long as there is an available worker thread.
2) All task ids lower than the waited one must complete before
unblocking even if the waited task id itself has completed.
In the initial version of taskq_lowest_id() the entire pending and
work list was locked under the tq->tq_lock to determine the lowest
outstanding taskqid. At the time this done because I was rushed
and wanted to make sure it was right... fast was secondary. Well now
fast is important too so I carefully thought through the pending
and work list management and convinced myself it is safe and correct
to simply check the first entry. I added a large comment to the source
to explain this. But basically as long as we are careful to ensure the
pending and work list stay sorted this is safe and fast.
The motivation for this chance was that I was observing as much as
10% of the total CPU time go to waiting on the tq->tq_lock when the
pending list was long. This resolves that problems and frees up
that CPU time for something useful.
This regression test could crash in splat_kmem_cache_test_reclaim()
due to a race between the slab relclaim and the normal exiting of
the thread. Specifically, the kct structure could be free'd by
the thread performing the allocations while the reclaim function
was also working on that's threads kct structure. The simplest
fix is to extend the kcp->kcp_lock over the reclaim to prevent
the kct from being freed. A better fix would be to ref count
these structures, but since is just a regression this locking
change is enough. Surprisingly this was only observed commonly
under RHEL5.4 but all platform could have hit this.
I must have been in a hurry when I wrote the vnode regression tests
because the error code handling is not correct. The Solaris vnode
API returns positive errno's, these need to be converted to negative
errno's for Linux before being passed back to user space. Otherwise
the test hardness with report the failure but errno will not be set
with the correct error code.
Additionally tests 3, 4, 6, and 7 may fail in the test file already
exists. To avoid false positives a user mode helper has added to
remove the test files in /tmp/ before running the actual test.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
The big fix here is the removal of kmalloc() in kv_alloc(). It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned. This is no longer true atleast in 2.6.31
there are no longer any alignment expectations. Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.
As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures. This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account. The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.
Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.
Use SPL_KMEM_CACHE_ALIGN in splat alignment test.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time. To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.
From linux-2.6.31 mm/page_alloc.c:1166
/*
* __GFP_NOFAIL is not to be used in new code.
*
* All __GFP_NOFAIL callers should be fixed so that they
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
WARN_ON_ONCE(order > 1);
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock. This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts. It however was
likely not good for performance.
After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely. The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t. The only lingering problem for both
implementations is that Solaris provides no atomic read function. This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking. I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API. So really
we are just being bug-for-bug compatible here.
With this change the default implementation is layered on top of Linux
atomic types. However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach. Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
The cmn_err/vcmn_err functions are layered on top of the debug
system which usually expects a newline at the end. However, there
really doesn't need to be a newline there and there in fact should
not be for the CE_CONT case so let's just drop the warning.
Also we make a half-hearted attempt to handle a leading ! which
means only send it to the syslog not the console. In this case
we just send to the the debug logs and not the console.
As of 2.6.25 kobj->k_name was replaced with kobj->name. Some distros
such as RHEL5 (2.6.18) add a patch to prevent this from being a problem
but other older distros such as SLES10 (2.6.16) have not. To avoid
the whole issue I'm updating the code to use kobject_set_name() which
does what I want and has existed all the way back to 2.6.11.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded. This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd. It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly. All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.
Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module. Because of this I have
had to add more autoconf magic than I'd like. However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.
In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type. And I've set rootdir to a non-NULL value.
For a generic explanation of why mutexs needed to be reimplemented
to work with the kernel lock profiling see commits:
e811949a57 and
d28db80fd0
The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.
Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved. The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.
Just as with the rwlocks we need to track the lock owner. Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you. If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked. This was added to Linux
to support adaptive mutexs, more on that shortly. Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build. If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior. This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.
Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance. From Linux kernel commit:
0d66bf6d3514b35eb6897629059443132992dbd7
"Testing with Ingo's test-mutex application...
gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol. The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.
Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds. This code was now redundant.
Update mutexs validated on:
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.
This means we need to track the lock owner which is not something
tracked in a Linux semaphore. After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools. My reasons:
1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel. This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant. This is important because it then maintains the location
information for each rw_* call point.
2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself. This removes
the need for a fancy lookup to get the owner which is optimal for
performance. We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.
3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.
4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization. This is good for all
the obvious reasons, we do give up the ability to specific the lock
name. The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.
Update rwlocks validated on:
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools. This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.
The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked. Since the
rwsem was embedded in a wrapper structure the name was always the
same. The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.
The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure. This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible. It can only be done by directly accessing some of
the internal data member of the rwsem structure.
For example, the Linux API provides a different function for dropping
a reader vs writer lock. Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is. This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock. Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation. This is
all do able, and what I did, but it does complicate things considerably.
The good news is that in addition to the profiling benefits of this
change. We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.
The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem. Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.
In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests. Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.
This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing. The full
SPLAT regression test suite was run numberous times on all of the following
platforms. This includes various kernels ranging from 2.6.16 to 2.6.29.
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state(). The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it. Several
autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
be to dynamically aquire it at load time. This is an absolute
last resort for old kernel which we don't want to patch to
cleanly export the symbol.
This interface is going away, and it's not as if most callers actually
use crhold/crfree when working with credentials. So it'll be okay
they we're not taking a reference on the task structure the odds of
it going away while working with a credential and pretty small.
The previous credential implementation simply provided the needed types and
a couple of dummy functions needed. This update correctly ties the basic
Solaris credential API in to one of two Linux kernel APIs.
Prior to 2.6.29 the linux kernel embeded all credentials in the task
structure. For these kernels, we pass around the entire task struct as if
it were the credential, then we use the helper functions to extract the
credential related bits.
As of 2.6.29 a new credential type was added which we can and do fairly
cleanly layer on top of. Once again the helper functions nicely hide
the implementation details from all callers.
Three tests were added to the splat test framework to verify basic
correctness. They should be extended as needed when need credential
functions are added.
The slab_overcommit test case could hang on a system with fragmented
memory because it was creating a kmem based slab with 256K objects.
To avoid this I've removed the KMC_KMEM flag which allows the slab
to decide if it should be kmem or vmem backed based on the object
side. The slab_lock test shares this code and will also be effected.
But the point of these two tests is to stress cache locking and
memory overcommit, the type of slab is not critical. In fact, allowing
the slab to do the default smart thing is preferable.
Simply pass the ioctl on to the normal handler. If the ioctl
helper macros are used correctly this should be safe as they
will handle the packing/unpacking of the data encoded in the
ioctl command. And actually, if the caller does not use the
IO* macros at all, and just passes small values, it will probably
be OK as well. We only get in to trouble if they try and use
the upper 32-bits. Endianness is not really a concern here, we
we are pretty much assumed they user and kernel will match.
used to scale the number of threads based on the number of online
CPUs. As CPUs are added/removed we should rescale the thread
count appropriately, but currently this is only done at create.
- Kernel modules should be built using the LINUX_OBJ Makefiles and
not the LINUX Makefiles to ensure the proper install paths are used.
- Install modules in to addon/spl/
- Ensure no additional kernel module build products are packaged.
- Simplified spl.spec.in which supports RHEL, CHAOS, SLES, FEDORA.
- Properly honor --prefix in build system and rpm spec file.
- Add '--define require_kdir' to spec file to support building
rpms against kernel sources installed in non-default locations.
- Add '--define require_kobj' to spec file to support building
rpms against kernel object installed in non-default locations.
- Stop suppressing errors in autogen.sh script.
- Improved logic to detect missing kernel objects when they are
not located with the source. This is the common case for SLES
as well as in-tree chaos kernel builds and is done to simply
support for multiple arches.
- Moved spl-devel build products to /usr/src/spl-<version>, a
spl symlink is created to reference the last installed version.
- Proper ioctl() 32/64-bit binary compatibility. We need to ensure the
ioctl data itself is always packed the same for 32/64-bit binaries.
Additionally, the correct thing to do is encode this size in bytes
as part of the command using _IOC_SIZE().
- Minor formatting changes to respect the 80 character limit.
- Move all SPLAT_SUBSYSTEM_* defines in to splat-ctl.h.
- Increase SPLAT_SUBSYSTEM_UNKNOWN because we were getting close
to accidentally using it for a real registered subsystem.
- Add compat_ioctl() handler, by default 64-bit SLES systems build 32-bit
ELF binaries. For the 32-bit binaries to pass ioctl information to a
64-bit kernel a compatibility handler needs to be registered. In our
case no additional conversions are needed to convert 32-bit ioctl()
commands to 64-bit commands so we can just call the default handler.
- Initial SLES testing uncovered a long standing bug in the debug
tracing. The tcd_for_each() macro expected a NULL to terminate
the trace_data[i] array but this was only ever true due to luck.
All trace_data[] iterators are now properly capped by TCD_TYPE_MAX.
- SPLAT_MAJOR 229 conflicted with a 'hvc' device on my SLES system.
Since this was always an arbitrary choice I picked something else.
- The HAVE_PGDAT_LIST case should set pgdat_list_addr to the value stored
at the address of the memory location returned by kallsyms_lookup_name().
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
Instead for_each_zone() operated directly on pgdat_list which may or
may not have been exported depending on how your kernel was compiled.
Now new configure checks determine if you have the helpers or not, and
if the needed symbols are exported. If they are not exported then they
are dynamically aquired at runtime by kallsyms_lookup_name().
- Enable builds for powerpc ISA type.
- Add DIV_ROUND_UP and roundup macros if unavailable.
- Cast 64-bit values for %lld format string to (long long) to
quiet compile warning.
- Configure check for SLES specific API change to vfs_unlink()
and vfs_rename() which added a 'struct vfsmount *' argument.
This was for something called the linux-security-module, but
it appears that it was never adopted upstream.
- Configure check, the div64_64() function was renamed to
div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
in 2.6.18 kernels. The earlier 2.6.16 based SLES10 must not try
and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
the HZ division as an 'unsigned long long' because the SPL only
implements __udivdi3(), and not __divdi3() for 'long long' division
on 32-bit arches.
We need dependent packages to be able to include spl_config.h so they
can leverage the configure checks the SPL has done. This is important
because several of the spl headers need the results of these checks to
work properly. Unfortunately, the autoheader build product is always
private to a particular build and defined certain common things.
(PACKAGE, VERSION, etc). This prevents other packages which also use
autoheader from being include because the definitions conflict. To
avoid this problem the SPL build system leverage AH_BOTTOM to include
a spl_unconfig.h at the botton of the autoheader build product. This
custom include undefs all known shared symbols to prevent the confict.
This does however mean that those definition are also not availble
to the SPL package either. The SPL package therefore uses the
equivilant SPL_META_* definitions.
In the interests of portability I have added a FC10/i686 box to
my list of development platforms. The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support. This patch
resolves all remaining issues observed under that environment.
1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added. As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages. The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state(). If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.
2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality. The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.
3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory. On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space. The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.
4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages(). This provides
additional test coverage of the SPL Linux VM integration.
5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M. On x86 based systems it is not possible to create
a kmem backed slab with entires of that size. To compensate for
this the number of allocations performed in increased by 4x.
6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.
7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol. This causes the module to fail
to load and we need to know exactly which symbol was not available.
I'm very surprised this has not surfaced until now. But the taskq_wait()
implementation work only wait successfully the first time it was called.
Subsequent usage of taskq_wait() on the taskq would not wait.
The issue was caused by tq->tq_lowest_id being set to MAX_INT after the
first wait completed. This caused subsequent waits which check that the
waiting id is less than the lowest taskq id to always succeed. The fix
is to ensure that tq->tq_lowest_id is never set larger than tq->tq_next.id.
Additional fixes which were added to this patch include:
1) Fix a race by placing the taskq_wait_check() in the tq->tq_lock spinlock.
2) taskq_wait() should wait for the largest outstanding id.
3) Multiple spelling corrections.
4) Added taskq wait regression test to validate correct behavior.
Mainly for portability reasons I have rebased the mutex tests on Solaris
taskqs instead of linux work queues. The linux workqueue API changed post
2.6.18 kernels and using task queues avoids having to conditionally detect
which workqueue API to use.
Additionally, this is basically free additional testing for the task queues.
Much to my surprise after updating these test cases they did expose a long
standing bug in the taskq_wait() implementation. This patch does not
address that issue but the followup patch does.
Fixes hostid mismatch which leads to assertion failure when the hostid/hw_serial is a 10-character decimal number:
$ zpool status
pool: lustre
state: ONLINE
lt-zpool: zpool_main.c:3176: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
zsh: 5262 abort zpool status
An update to the build system to properly support all commonly
used Makefile targets these include:
make all # Build everything
make install # Install everything
make clean # Clean up build products
make distclean # Clean up everything
make dist # Create package tarball
make srpm # Create package source RPM
make rpm # Create package binary RPMs
make tags # Create ctags and etags for everything
Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels. To build binary
RPMs from the source RPM for your system simply run:
rpmbuild --rebuild spl-x.y.z-1.src.rpm
This will produce two binary RPMs with correct 'requires'
dependencies for your kernel. One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.
spl-x.y.z-1_<kernel version>.x86_64.rpm
spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address. The function name itself can them become a
define which calls a function pointer. This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel. This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.
This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name(). There are
autoconf checks to detect if the symbol is exported and if so to
use it directly. We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported. Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement. The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.
Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
of $LINUX because Module.symvers is a build product. When
$LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
$LINUX/include search paths to allow proper compilation when
the kernel target build directory is not the source directory.
Minimal support added for the zone_get_hostid() function. Only
global zones are supported therefore this function must be called
with a NULL argumment. Additionally, I've added the HW_HOSTID_LEN
define and updated all instances where a hard coded magic value
of 11 was used; "A good riddance of bad rubbish!"
Accidentally leaked list item li in error path. The fix is to
adjust this error path to ensure the allocated list item which
has not yet been added to the list gets freed. To do this we
simply add a new goto label slightly earlier to use the existing
cleanup logic and minimize the number of unique return points.
This was a false positive the callpath being walked is impossible
because the splat_kmem_cache_test_kcp_alloc() function will ensure
kcp->kcp_kcd[0] is initialized to NULL. However, there is no harm
is making this explicit for the test case so I'm adding a line to
clearly set it to correct the analysis.
This check was originally added to detect double initializations
of mutex types (which it did find). Unfortunately, Coverity is
right that there is a very small chance we could trigger the
assertion by accident because an uninitialized stack variable
happens to contain the mutex magic. This is particularly unlikely
since we do poison the mutexs when destroyed but still possible.
Therefore I'm simply removing the assertion.
- The previous magazine ageing sceme relied on the on_each_cpu()
function to call spl_magazine_age() on each cpu. It turns out
this could deadlock with do_flush_tlb_all() which also relies
on the IPI based on_each_cpu(). To avoid this problem a per-
magazine delayed work item is created and indepentantly
scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
spl_kmem_cache_t, they were hold overs from previous cleanup.
- struct work_struct work
- struct timer_list timer
- spl_slab_reclaim() 'continue' changed back to 'break' from commit
37db7d8cf9. The original was correct,
I have added a comment to ensure this does not happen again.
- spl_slab_reclaim() further optimized by moving the destructor call
in spl_slab_free() outside the skc->skc_lock. This minimizes the
length of time the spin lock is held, allows the destructors to
be invoked concurrently for different objects, and as a bonus makes
it safe (although unwise) to sleep in the destructors.
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
spl_slab_reclaim() to cap the number of slabs reclaimed.
On Solaris all eligible slabs are always reclaimed, and this
is still the default behavior. However, I suspect that is
not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
slab/object free lists. Soft lockups were observed when
freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
incorrect 'break' to 'continue' to ensure all slabs are
checked.
- spl_cache_age() reworked to avoid a deadlock with
do_flush_tlb_all() which occured because we slept waiting
for completion in spl_cache_age(). To waiting for magazine
reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
back to using for_each_online_cpu() instead of the
spl_on_each_cpu() approach which was of course a bad idea
due to memory allocations which Ricardo pointed out.
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning. On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.
All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior. Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree. These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.
When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered. When a minor is unregistered we use the vnode
interface to unlink the special file.
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
if the older 4 argument version of on_each_cpu() should be
used or the new 3 argument version. The retry argument was
dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers. The old way this
worked was to pass a data point when initialized the workqueue.
The new API assumed the work item is embedding in a structure
and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
type checked in the bit operations. This silences the warnings.
- Updated autogen products and splat tests accordingly
- Added slab work queue task which gradually ages and free's slabs
from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock. We
now unlink the objects and slabs from the cache itself and attach
them to a private work list. The contents of the list are then
subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
per-cache skc_lock. This includes the addition of KMC_BIT_REAPING
bit mask which is used to prevent concurrent reaping, and to defer
new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
destroyed, this is used to catch any task accessing the cache while
it is being destroyed.
- Add comments to all the functions and additional comments to try
and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
rigerously stress the cache implementation and look for any problems.
This includes correctness and performance tests.
- Updated portable work queue interfaces