Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	0cb3dafccd	Update SPLAT to use kmutex_t for portability For consistency throughout the code update the SPLAT infrastructure to use the wrapped mutex interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00
Brian Behlendorf	6203295438	Make license compatibility checks consistent Apply the license specified in the META file to ensure the compatibility checks are all performed consistently. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00
Brian Behlendorf	81857a34d1	Fix bug in SPLAT taskq:front While running SPLAT on a kernel with CONFIG_DEBUG_ATOMIC_SLEEP enabled the taskq:front was flagged as a test which might sleep which in an unsafe context. Specifically, the splat_vprint() function which internally takes a mutex was being called under a spin lock. Moving the log function outside the spin lock cleanly solves this issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-03 10:42:20 -07:00
Turbo Fredriksson	e3020723dc	Linux 3.16 compat: smp_mb__after_clear_bit() The smp_mb__{before,after}_clear_bit functions have been renamed smp_mb__{before,after}_atomic. Rather than adding a compatibility function to handle this the code has been updated to use smp_wmb(). This has the advantage of being a stable functionally equivalent interface. On many architectures smp_mb__after_clear_bit() expands to smp_wmb(). Others might be able to do something slightly more efficient but this will be safe and correct on all of them. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #386	2014-09-22 16:24:55 -07:00
Richard Yao	ec18fe3ce8	Cleanup vn_rename() and vn_remove() zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented Linux 3.6+ support by adding duplicate vn_rename and vn_remove functions. The new ones were cleaner, but the duplicate functions made the codebase less maintainable. This adds some compatibility shims that allow us to retire the older vn_rename and vn_remove in favor of the new ones on old kernels. The result is a net 143 line reduction in lines of code and a cleaner codebase. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #370	2014-08-13 16:25:44 -07:00
Ned Bass	2fc44f66ec	Linux 3.17 compat: remove wait_on_bit action function Linux kernel 3.17 removes the action function argument from wait_on_bit(). Add autoconf test and compatibility macro to support the new interface. The former "wait_on_bit" interface required an 'action' function to be provided which does the actual waiting. There were over 20 such functions in the kernel, many of them identical, though most cases can be satisfied by one of just two functions: one which uses io_schedule() and one which just uses schedule(). This API change was made to consolidate all of those redundant wait functions. References: torvalds/linux@7431620 Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #378	2014-08-11 14:17:00 -07:00
Brian Behlendorf	f2297b5a89	Set spl_kmem_cache_slab_limit=16384 to default For small objects the Linux slab allocator should be used to make the most efficient use of the memory. However, large objects are not supported by the Linux slab and therefore the SPL implementation is preferred. A cutoff of 16K was determined to be optimal for architectures using 4K pages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #356 Closes #379	2014-08-08 08:51:45 -07:00
Brian Behlendorf	c1aef26944	Set spl_kmem_cache_reclaim=0 to default Reinstate the correct default behavior of returning the number of objects in the cache for reclaim. This behavior was disabled in recent releases to do occasional reports of spinning in shrink_slabs(). Those issues have been resolved and can no longer can be reproduced. See commit `376dc35`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #358 Closes #379	2014-08-08 08:50:03 -07:00
Tim Chase	7f23e00109	Add functions and macros as used upstream. Added highbit64() and howmany() which are used in recent upstream code. Both highbit() and highbit64() should at some point be re-factored to use the optimized fls() and fls64() functions. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #363	2014-07-22 09:47:48 -07:00
Brian Behlendorf	377e12f14a	Rate limit debugging stack traces There have been issues in the past where excessive debug logging to the console has resulted in significant performance impacts. In the vast majority of these cases only a few stack traces are required to diagnose the issue. Therefore, stack traces dumped to the console will now we limited to 5 every 60s. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes #374	2014-07-22 09:47:24 -07:00
Tim Chase	f6a869614e	Safer debugging and assertion macros. Spl's debugging and assertion macros macro used the typical do/while(0) form for if/else friendliness, however, this limits their use in contexts where a do loop is not valid; such as within another multi-statement style macro. The following macros have been converted to not use do/while(0): PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL PANIC has been converted to a wrapper around the new spl_PANIC() function. The other macros have been converted to use the "&&" operator for the branch-predicition conditional and also to use spl_PANIC(). The __ASSERT() macro was not touched. It is only used by the debugging infrastructure and that code, including this macro, will be retired when the tracepoint patches are merged. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #367	2014-07-01 15:14:43 -07:00
Brian Behlendorf	376dc35e22	Add spl_kmem_cache_reclaim module option The correct behavior for all registered shrinkers is to return the number of objects in their cache. In theory this allows the Linux VM to balance memory reclaim across all registered caches. In commit `b9b3715` this behavior was disabled in favor of returning -1 which notifies the VM that no additional objects are available for reclaim. This was done as a workaround to resolve thrashing in shrink_slabs() which could occur when memory was low and numerous core where in reclaim. Unfortunately, this has been observed to increase the likelihood of OOM events when SPL slab consumers are responsible for consuming the majority of memory. Therefore, this patch makes this behavior tunable. Setting the spl_kmem_cache_reclaim module option to 0x1 will result in the shrinker only being called once. This is the default behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes #358	2014-05-22 10:30:12 -07:00
Brian Behlendorf	a073aeb060	Add KMC_SLAB cache type For small objects the Linux slab allocator has several advantages over its counterpart in the SPL. These include: 1) It is more memory-efficient and packs objects more tightly. 2) It is continually tuned to maximize performance. Therefore it makes sense to layer the SPLs slab allocator on top of the Linux slab allocator. This allows us to leverage the advantages above while preserving the Illumos semantics we depend on. However, there are some things we need to be careful of: 1) The Linux slab allocator was never designed to work well with large objects. Because the SPL slab must still handle this use case a cut off limit was added to transition from Linux slab backed objects to kmem or vmem backed slabs. spl_kmem_cache_slab_limit - Objects less than or equal to this size in bytes will be backed by the Linux slab. By default this value is zero which disables the Linux slab functionality. Reasonable values for this cut off limit are in the range of 4096-16386 bytes. spl_kmem_cache_kmem_limit - Objects less than or equal to this size in bytes will be backed by a kmem slab. Objects over this size will be vmem backed instead. This value defaults to 1/8 a page, or 512 bytes on an x86_64 architecture. 2) Be aware that using the Linux slab may inadvertently introduce new deadlocks. Care has been taken previously to ensure that all allocations which occur in the write path use GFP_NOIO. However, there may be internal allocations performed in the Linux slab which do not honor these flags. If this is the case a deadlock may occur. The path forward is definitely to start relying on the Linux slab. But for that to happen we need to start building confidence that there aren't any unexpected surprises lurking for us. And ideally need to move completely away from using the SPLs slab for large memory allocations. This patch is a first step. NOTES: 1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab backed caches but it is not supported for kmem/vmem backed caches. 2) Regardless of the spl_kmem_cache_*_limit settings a cache may be explicitly set to a given type by passed the KMC_KMEM, KMC_VMEM, or KMC_SLAB flags during cache creation. 3) The constructors, destructors, and reclaim callbacks are all functional and will be called regardless of the cache type. 4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to the issues involved in presenting correct object accounting. Instead they will appear in /proc/slabinfo under the same names. 5) Several kmem SPLAT tests needed to be fixed because they relied incorrectly on internal kmem slab accounting. With the updated test cases all the SPLAT tests pass as expected. 6) An autoconf test was added to ensure that the __GFP_COMP flag was correctly added to the default flags used when allocating a slab. This is required to ensure all pages in higher order slabs are properly refcounted, see `ae16ed9`. 7) When using the SLUB allocator there is no need to attempt to set the __GFP_COMP flag. This has been the default behavior for the SLUB since Linux 2.6.25. 8) When using the SLUB it may be desirable to set the slub_nomerge kernel parameter to prevent caches from being merged. Original-patch-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #356	2014-05-22 10:28:01 -07:00
Chunwei Chen	ad3412efd7	Linux 3.15: vfs_rename() added a flags argument Detect the updated vfs_rename() interface and call it with an extra flags argument. References: torvalds/linux@520c8b1 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #355	2014-05-07 13:38:17 -07:00
Andrey Vesnovaty	703371d8c7	Evenly distribute the taskq threads across available CPUs The problem is described in commit `aeeb4e0c0a`. However, instead of disabling the binding to CPU altogether we just keep the last CPU index across calls to taskq_create() and thus achieve even distribution of the taskq threads across all available CPUs. The implementation based on assumption that task queues initialization performed in serial manner. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #336	2014-04-25 15:29:18 -07:00
Chunwei Chen	ae16ed992b	Fix crash when using ZFS on Ceph rbd When using __get_free_pages to get high order memory, only the first page's _count will set to 1, other's will be 0. When an internal page get passed into rbd, it will eventully go into tcp_sendpage. There, it will be called with get_page and put_page, and get freed erroneously when _count jump back to 0. The solution to this problem is to use compound page. All pages in a high order compound page share a single _count. So get_page and put_page in tcp_sendpage will not cause _count jump to 0. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #251	2014-04-25 15:26:52 -07:00
Richard Yao	89aa97059d	Change spl_kmem_cache_expire default setting to 2 This behavior is more consistent with the way memory reclaim is expected to work under Linux. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #349	2014-04-14 16:29:01 -07:00
Andrey Vesnovaty	bdfbe594a1	Expose max/min objs per slab and max slab size By default maximal number of objects in slab can't exceed (162 - 1) and slab size can't exceed 32M. Today's high end servers having couple hundreds of RAM available for ARC may run into a trouble with virtual memory because of the restriction mentioned above. Problem: Reasons for very high number of virtual memory allocations: Real slab size very small relative to the size of the entire RAM * Slabs allocated on virtual memory and fill entire ARC The result is very high number of allocated virtual memory ranges (hundreds of ranges). When virtual memory subsystem manages high number of ranges its performance become so poor that it freezes from time to time. Solution: Number of objects per slab should be increased taking into account maximal slab size which can also be increased if needed. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #337	2014-04-14 09:42:04 -07:00
Chunwei Chen	545e9ac00a	Add ddi_time_after and friends When comparing times gotten from ddi_get_lbolt, we have to take account of wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should use 't1 - t2 < 0'. This patch add ddi_time_after and friends to address this issue. They have strict type restriction, clock_t for vanilla and int64_t for 64 version, to prevent type conversion from screwing things. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #335	2014-04-14 09:32:01 -07:00
Richard Yao	acf0ade362	Simplify hostid logic There is plenty of compatibility code for a hw_hostid that isn't used by anything. At the same time, there are apparently issues with the current hostid logic. coredumb in #zfsonlinux on freenode reported that Fedora 17 changes its hostid on every boot, which required force importing his pool. A suggestion by wca was to adopt FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does not exist Adopting FreeBSD's behavior permits us to eliminate plenty of code, including a userland helper that invokes the system's hostid as a fallback. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #224	2014-04-14 09:04:41 -07:00
Tim Chase	3ceb71e896	Call kthread_create() correctly with fixed arguments. The kernel's kthread_create() function is defined as "..." and there is no va_list variant at the moment. The task name is pre-formatted into a local buffer and passed to kthread_create() with fixed arguments. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #347	2014-04-11 09:41:40 -07:00
Tim Chase	ed650dee76	De-inline spl_kthread_create(). The function was defined as a static inline with variable arguments which causes gcc to generate errors on some distros. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #346	2014-04-09 19:17:12 -07:00
Tim Chase	17a527cb0f	Support post-3.13 kthread_create() semantics. Provide spl_kthread_create() as a wrapper to the kernel's kthread_create() to provide pre-3.13 semantics. Re-try if the call is interrupted or if it would have returned -ENOMEM. Otherwise return NULL. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #339	2014-04-08 12:44:42 -07:00
Brian Behlendorf	e19101e08f	splat cred:groupmember: Fix false positives Due to certain assumptions made in the the cred:groupmember test it could result in false positives when run on specific distributions. This was solely a bug in the test case and not in the groupmember() function which the test case was validating. To prevent future false positives the test case has been rewritten to be both more rigerous and to make fewer assumptions about the system. Minor style cleanup was done to cr_groups_search() and groupmember() functions. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-04-08 12:44:41 -07:00
Brian Behlendorf	668d2a0da5	splat kmem:slab_reclaim: Test cleanup By setting __GFP_NORETRY the kernel memory reclaim logic was allowed to abort early and dump a falled allocation stack to the console. Since this was done in a tight loop to fill memory it could result in a large number of stacks being dumped to the console. This in turn slowed down the test sufficiently so it exceeded the time limit and failed. To resolve this issue the __GFP_NORETRY flag is being removed. This is how it should have been called originally to ensure we're simulating the behavior of most callers which will use the GFP_KERNEL flag. In addition, the reclaim granularity of 1000 objects was far to coarse for this to be a realistic test. For kmem:slab_reclaim there might only be a few thousand objects total in the cache. Therefore, the SPLAT_KMEM_OBJ_RECLAIM constant for these tests was lowered. This will cause the reclaim callback to run more frequently which makes for a better test case. The frequency of the cache reaping in kmem:slab_reap was increased to accommodate the reduced number of objects released during the reclaim. These changes only impact the test cases and were done to remove false positives caused by the test case itself. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-04-08 12:44:41 -07:00
Brian Behlendorf	aeeb4e0c0a	Remove default taskq thread to CPU bindings When this code was written it appears to have been assumed that every taskq would have a large number of threads. In this case it would make sense to attempt to evenly bind the threads over all available CPUs. However, it failed to consider that creating taskqs with a small number of threads will cause the CPUs with lower ids become over-subscribed. For this reason the kthread_bind() call is being removed and we're leaving the kernel to schedule these threads as it sees fit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #325	2014-01-07 10:46:24 -08:00
Brian Behlendorf	921a35adeb	Add module versioning Use the standard Linux MODULE_VERSION macro to expose the installed spl and splat module versions. This will also automatically add a checksum of the .c files and headers in "srcversion". See: /sys/module/spl/version /sys/module/spl/srcversion /sys/module/splat/version /sys/module/splat/srcversion Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#1923 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-12-06 11:03:43 -08:00
Richard Yao	50a0749eba	Linux 3.13 compat: Pass NULL for new delegated inode argument This check was originally added for SLES10, `a093c6a`, to check for a 'struct vfsmount *' argument which they added. However, since SLES10 is based on a 2.6.16 kernel which is no longer supported this functionality was dropped. The checks were refactored to support Linux 3.13 without concern for historical versions. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #312	2013-12-02 10:37:49 -08:00
Richard Yao	3e96de17d7	Linux 3.13 compat: Remove unused flags variable from __cv_init() GCC 4.8.1 complained about an unused flags variable when building against Linux 2.6.26.8: /var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c: In function ‘__cv_init’: /var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:39:6: error: variable ‘flags’ set but not used [-Werror=unused-but-set-variable] int flags = KM_SLEEP; ^ cc1: all warnings being treated as errors Additionally, the superfluous code uses a preempt_count variable that is no longer available on Linux 3.13. Deleting the unnecessary code fixes a Linux 3.13 compatibility issue. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #312	2013-12-02 10:11:19 -08:00
Ned Bass	184c687387	Emulate illumos interface cv_timedwait_hires() Needed for Illumos #3582. This interface is supposed to support a variable-resolution timeout with nanosecond granularity. This implementation rounds up to microsecond resolution, as nanosecond- precision timing is rarely needed for real-world performance tuning and may incur unnecessary busy-waiting. usleep_range() is used if available, otherwise udelay() or msleep() are used depending on the length of the delay interval. Add flags from sys/callo.h as these are used to control the behavior of cv_timedwait_hires(). Specifically, CALLOUT_FLAG_ABSOLUTE Normally, the expiration passed to the timeout API functions is an expiration interval. If this flag is specified, then it is interpreted as the expiration time itself. CALLOUT_FLAG_ROUNDUP Roundup the expiration time to the next resolution boundary. If this flag is not specified, the expiration time is rounded down. References: https://www.illumos.org/issues/3582 illumos/illumos-gate@0689f76 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #304	2013-11-04 09:49:24 -08:00
Ned Bass	f483a97a41	3537 add kstat_waitq_enter and friends These kstat interfaces are required to port "Illumos #3537 want pool io kstats" to ZFS on Linux. kstat_waitq_enter() kstat_waitq_exit() kstat_runq_enter() kstat_runq_exit() Additionally, zero out the ks_data buffer in __kstat_create() so that the kstat_io_t counters are initialized to zero. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:41:52 -07:00
Cyril Plisko	ffbf0e57c2	Kstat to use private lock by default While porting Illumos #3537 I found that ks_lock member of kstat_t structure is different between Illumos and SPL. It is a pointer to the kmutex_t in Illumos, but the mutex lock itself in SPL. Apparently Illumos kstat API allows consumer to override the lock if required. With SPL implementation it is not possible anymore. Things were alright until the first attempt to actually override the lock. Porting of Illumos #3537 introduced such code for the first time. In order to provide the Solaris/Illumos like functionality we: 1. convert ks_lock to "kmutex_t *ks_lock" 2. create a new field "kmutex_t ks_private_lock" 3. On kstat_create() ks_lock = &ks_private_lock Thus if consumer doesn't care we still have our internal lock in use. If, however, consumer does care she has a chance to set ks_lock to anything else before calling kstat_install(). The rest of the code will use ks_lock regardless of its origin. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #286	2013-10-25 13:41:30 -07:00
Brian Behlendorf	ce07767f79	Revert "Add KSTAT_TYPE_TXG type" This reverts commit `dba79fcbf2` in favor of using the generic KSTAT_TYPE_RAW callbacks. The advantage of this approach is that arbitrary types can be added without the need to add them to the SPL. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #296	2013-10-16 14:48:35 -07:00
Prakash Surya	56d40a686b	Add callbacks for displaying KSTAT_TYPE_RAW kstats The current implementation for displaying kstats of type KSTAT_TYPE_RAW is rather crude. This patch attempts to enhance this handling by allowing a kstat user to register formatting callbacks which can optionally be used. The callbacks allow the user to implement functions for interpreting their data and transposing it into a character buffer. This buffer, containing a string representation of the raw data, is then be displayed through the current /proc textual interface. Additionally the kstats are made writable because it's now possible to provide a useful handler via the existing ks_update() interface. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #296	2013-10-16 14:48:35 -07:00
Brian Behlendorf	429fe89cee	Consistently use local_irq_disable/local_irq_enable It was observed that spl_kmem_cache_alloc() uses local_irq_save() and saves the interrupt state in a local variable. This would normally be fine except that spl_kmem_cache_alloc() calls spl_cache_refill() which re-enables interrupts. It is then possible that while interrupts are enabled the process is rescheduled to a different cpu before being disable again. This could result in us restoring the saved interrupt state from one cpu to another. What the consequences of this are aren't perfectly clear, but this is clearly a bug and it has the potential to cause issues. The code has been updated to just use local_irq_enable() and local_irq_disable() to avoid this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-09 14:00:56 -07:00
Richard Yao	df2c0f1849	Replace current_kernel_time() with getnstimeofday() current_kernel_time() is used by the SPLAT, but it is not meant for performance measurement. We modify the SPLAT to use getnstimeofday(), which is equivalent to the gethrestime() function on Solaris. Additionally, we update gethrestime() to invoke getnstimeofday(). Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #279	2013-10-09 13:28:30 -07:00
Richard Yao	f7fd6ddd96	Linux 3.8 compat: Use kuid_t/kgid_t when required When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are replaced by kuid_t/kgid_t, which are structures instead of integral types. This causes any code that uses an integral type to fail to build. The User Namespace functionality introduced in Linux 3.8 requires CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any kernel that supported it. We resolve this by converting between the new kuid_t/kgid_t structures and the original uid_t/gid_t types. Original-patch-by: DHE Rewrite-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #260	2013-08-09 10:09:29 -07:00
Richard Yao	e3c4d44886	PaX/GrSecurity Linux 3.8.y compat: Use __no_const on struct ctl_table The PaX team started constifying `struct ctl_table` as of their Linux 3.8.0 patchset. This lead to zfsonlinux/spl#225 and Gentoo bug #463012. While investigating our options, I learned that there is a preprocessor directive called CONSTIFY_PLUGIN that we can use to detect the presence of the PaX changes and adjust the code accordingly. The PaX Team had suggested adopting ctl_table_no_const, but supporting older kernels required declaring that whenever the CONSTIFY_PLUGIN was set. Future compiler changes could potentially cause that to break in the presence of -Werror, so instead we define our own spl_ctl_table typdef and use that. This should be compatible with all PaX kernels. This introduces a Linux kernel version number check to prevent a build failure on versions of the PaX GCC plugin that existed for kernels before Linux 3.8.0. Affected versions of the PaX plugin will trigger a compiler error when they see no_const cast on a non-constified structure. Ordinarily, we would need an autotools check to catch that. However, it is safe to do a kernel version check instead of an autotools check in this specific instance because the affected versions of the PaX GCC plugin only exist for Linux kernels before 3.8.0 and the constification of `struct ctl_table` by the PaX developers only occurs in Linux 3.8.0 and later. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #225	2013-08-08 09:51:34 -07:00
Richard Yao	251e7a779b	Fix race in spl_kmem_cache_reap_now() The current code contains a race condition that triggers when bit 2 in spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked and another thread is concurrently accessing its magazine. spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is set. This is unsafe because there is one magazine per CPU and the magazines are lockless, so it is impossible to guarentee that another CPU is not using its magazine when this function is called. The solution is to only touch the local CPU's magazine and leave other CPU's magazines to other CPUs. Reported-by: DHE Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #274	2013-08-08 09:14:41 -07:00
Richard Yao	ba06298072	Linux 3.11 compat: Replace num_physpages with totalram_pages num_physpages was removed by torvalds/linux@cfa11e08ed, so lets replace it with totalram_pages. This is a bug fix as much as it is a compatibility fix because num_physpages did not reflect the number of pages actually available to the kernel: http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html Also, there are known issues with memory calculations when ZFS is in a Xen dom0. There is a chance that using totalram_pages could resolve them. This conjecture is untested at the time of writing. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #273	2013-08-08 09:14:29 -07:00
Brian Behlendorf	ceb3872825	Fix KMC_OFFSLAB type caches Because spl_slab_size() was always returning -ENOSPC for caches of type KMC_OFFSLAB the cache could never be created. Additionally the slab size is rounded up to a page which is what kv_alloc() expects. The kv_alloc() code will minimally allocate a page, in the KMC_OFFSLAB case this could be reduced. The basic regression tests kmem:slab_small, kmem:slab_large, and kmem:slab_align regression were updated to test KMC_OFFSLAB. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ying Zhu <casualfisher@gmail.com> Closes #266	2013-07-30 15:39:23 -07:00
Brian Behlendorf	b9b3715346	Return -1 for generic kmem cache shrinker It has been observed that it's possible to get in a state where shrink_slabs() will spin repeated invoking the generic kmem cache shrinker. It fails to detect it's not making forward progress reclaiming from the cache and doesn't give up. To ensure this never occurs we unconditionally return -1 after reclaiming what we can. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes zfsonlinux/zfs#1276 Closes zfsonlinux/zfs#1598 Closes zfsonlinux/zfs#1432	2013-07-30 15:33:24 -07:00
James H	c47efbc7fd	Modify gethrestime to use current_kernel_time() This allows us to get nanosecond resolution. It also means we use the same time source as utimensat(now) etc. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #255	2013-07-15 09:17:19 -07:00
Brian Behlendorf	ab4e74cc38	Fix bogus kmem leak warning Commit `5c7a036` correctly relocated the creation of a taskq and the registraction of the kmem_cache_shrinker after the initialization of the kmem tracking code. However, the cleanup of these structures was not done before the leak checks in spl_kmem_fini(). This resulted in an incorrect 'kmem leaked' warning even though there was no actual leak. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#1569	2013-07-10 15:08:22 -07:00
Brian Behlendorf	b1424adda5	Fix --enable-debug-kmem-tracking option This code has gotten something stale and no longer builds cleanly against modern kernels. The two issues addressed here are as follows: * The hlist__rcu interfaces in the kernel have been relatively unstable. Since this isn't performance critical code just use the long standing hlist_ variants. * In older kernels the hash_ptr() function takes a 'void ' but in newer kernels it expects a 'const void '. To silence the compiler warnings about this explicitly cast it to a 'void '. The memset function is a similar case but it always expects a 'void '. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #256	2013-07-09 09:23:54 -07:00
Richard Yao	f2a745c41d	Linux 3.10 compat: Do not rely on struct proc_dir_entry definition Linux kernel commit torvalds/linux#59d8053f moved the definition of struct proc_dir_entry from include/linux/proc_fs.h to the private header fs/proc/internal.h. The SPL relied on that to map Solaris' kstat to entries in /proc/spl/kstat. Since the proc_dir_entry structure is now private the only safe thing to do is wrap the opaque proc handle with our own structure. This actually ends up simplify the code and is good because it moves us away from depending on implementation details of /proc. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #257	2013-07-08 15:25:18 -07:00
Yuxuan Shui	79a7ab2581	Linux 3.10 compat: add missing include of linux/slab.h Linux kernel commit torvalds/linux@0d01ff2 changes some includes we were depending on through linux/proc_fs.h. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #257	2013-07-08 15:21:28 -07:00
Yuxuan Shui	1ddf9722dc	Linux 3.10 compat: replace PDE()->data with PDE_DATA() Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to PDE_DATA(). To handle this detect the prefered interface and define a PDE_DATA() wrapper for consistency. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #257	2013-07-08 15:14:21 -07:00
Tim Chase	5c7a0369e2	Fix --enable-debug-kmem-tracking option Re-order initialization in spl_kmem_init to allow for kmem tracing to work. The spl_kmem_init function calls taskq_create prior to initializing the tracking (calling spl_kmem_init_tracking). Since taskq_create uses kmem_alloc, NULL dereferences occur because the global kmem_list hasn't had its next & prev pointers initialized yet. This commit moves the calls to spl_kmem_init_tracking earlier in the spl_kmem_init function in order that the subsequent kmem_alloc calls (by taskq_create) work properly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #243	2013-06-18 11:40:33 -07:00
Brian Behlendorf	99c452bbba	Fix taskq_wait_id() The existing taskq_wait_id() function can incorrectly block indefinitely. Reimplement it more simply using wait_event() in a similar fashion to taskq_wait_all(). This flaw was uncovered in the context of moving vn_rdwr() to a taskq. Previously taskq_wait_id() had no consumers outside the SPLAT task framework which is why the issue went unnoticed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-05-03 14:32:29 -07:00
Jan Engelhardt	a9e86ac4fd	gitignore: anchor entries at their respective directory .ko is specific to module, .m4 to config, etc. Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-04-02 11:07:52 -07:00
Richard Yao	feaf1e321d	Do not call cond_resched() in spl_slab_reclaim() Calling cond_resched() after each object is freed and then after each slab is freed can cause slabs of objects to live for excessive periods of time following reclaimation. This interferes with the kernel's own memory management when called from kswapd and can cause direct reclaim to occur in response to memory pressure that should have been resolved. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>	2013-03-21 12:58:44 -07:00
Richard Yao	4a31e5aa9b	Linux 3.9 compat: Switch to hlist_for_each{,_rcu} torvalds/linux@b67bfe0d42 changed hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle this by switching to hlist_for_each{,_rcu}, which works across all supported kernels. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 10:43:34 -07:00
Richard Yao	8274ed5988	Drop support for 3 argument version of set_fs_pwd This was a suggestion that Brian Behlendorf made when reviewing an early pull request for Linux 3.9 support. This commit was made intentionally easy to revert should we ever have a reason to reintroduce support for older kernels. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 10:43:31 -07:00
Richard Yao	a54718cfe0	Linux 3.9 compat: set_fs_root takes const struct path * torvalds/linux@dcf787f391 enforces const-correctness in passing struct path *. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 10:43:29 -07:00
Richard Yao	2a305c34c8	Linux 3.9 compat: vfs_getattr takes two arguments The function prototype of vfs_getattr previoulsy took struct vfsmount * and struct dentry * as arguments. These would always be defined together in a struct path . torvalds/linux@3dadecce20 modified vfs_getattr to take struct path is taken as an argument instead. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 10:43:26 -07:00
Richard Yao	bc90df6688	Linux 3.9 compat: Do not depend on f_vfsmnt torvalds/linux@182be68478 removed the preprocessor definition for f_vfsmnt. The ability to access the mountpoint via ->f_path.mnt has been stable for a long time, so we switch to that. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 10:43:23 -07:00
Ned Bass	3d6af2dd6d	Refresh links to web site Update links to refer to the official ZFS on Linux website instead of @behlendorf's personal fork on github. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-04 19:09:34 -08:00
Brian Behlendorf	0298f3d67f	Add KMODDIR to install target Provide a mechanism to control the directory name the modules are installed in. The kernel privdes INSTALL_MOD_DIR for this but it was hardcoded to be 'addon/spl'. Add a KMODDIR variable which can be passed to 'make install' to override the default directory name. While we're here change the default from 'addon/spl' to 'extra' which is the kernel.org default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-01 16:55:06 -08:00
Brian Behlendorf	4bf3909e51	Disable automatic log dumping Long ago infrastructure was added to the SPL to keep an internal debug log of the last few seconds of activity. This was helpful during the early development, but these days it is no longer needed. I haven't had to resort to this debug buffer to resolve an issue for several years now. Today better more generic tools like systemtap and ftrace have evolved to the point where they can be used for this purpose. Along with the stack trace dumped to the system console, and in rare cases a crash dump we almost always have the debug we need. Therefore, I'm disabling the code which automatically dumps this log to disk during an assertion except for the case where spl_debug_panic_on_bug is set (disabled by default). This should be viewed as a first step towards either. a) Retiring this infrastructure and complexity entirely, or b) Integrating this logging more properly with ftrace. As part of this change I'm also removing from the packages the undocumented spl utility which is used to decode the binary logs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-02-05 16:13:27 -08:00
Brian Behlendorf	6ef94aa67a	Fix tsd_get/set() race with tsd_exit/destroy() The tsd_exit() and tsd_destroy() functions remove entries from hash bins without taking the hash bin lock. They do take the table lock, but tsd_get() and tsd_set() only take the hash bin lock to allow for maximum concurency. The result is that while tsd_get() and tsd_set() are traversing the hash bin list it can be modified by another thread in which happens to hash to the same value. To avoid this add the needed locking to tsd_exit() and tsd_destroy(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #174	2013-01-31 13:54:59 -08:00
Brian Behlendorf	0936c3449f	Add spl_kmem_cache_expire module option Cache aging was implemented because it was part of the default Solaris kmem_cache behavior. The idea is that per-cpu objects which haven't been accessed in several seconds should be returned to the cache. On the other hand Linux slabs never move objects back to the slabs unless there is memory pressure on the system. This behavior is now configurable through the 'spl_kmem_cache_expire' module option. The value is a bit mask with the following meaning. 0x1 - Solaris style cache aging eviction is enabled. 0x2 - Linux style low memory eviction is enabled. Both methods may be safely enabled simultaneously, but by default both are disabled. It has never been clear if the kmem cache aging (which has been around from day one) actually does any good. It has however been the source of numerous bugs so I wouldn't mind retiring it entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#1227 Closes #210	2013-01-28 09:34:12 -08:00
Brian Behlendorf	84dd1f4f15	Remove spl_invalidate_inodes() This functionality is no longer required by ZFS, see commit zfsonlinux/zfs@7b3e34ba5a. Since there are no other consumers, and because it adds additional autoconf complexity which must be maintained the spl_invalidate_inodes() function has been removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#795	2013-01-17 11:40:47 -08:00
Brian Behlendorf	d4899f4747	kmem-cache: Fix slab ageing soft lockup Commit `a10287e00d` slightly reworked the slab ageing code such that it is no longer dependent on the Linux delayed work queue interfaces. This was good for portability and performance, but it requires us to use the on_each_cpu() function to execute the spl_magazine_age() function. That means that the function is now executing in interrupt context whereas before it was scheduled in normal process context. And that means we need to be slightly more careful about the locking in the interrupt handler. With the reworked code it's possible that we'll be holding the skc->skc_lock and be interrupted to handle the spl_magazine_age() IRQ. This will result in a deadlock and soft lockup errors unless we're careful to detect the contention and avoid taking the lock in the interupt handler. So that's what this patch does. Alternately, (and slightly more conventionally) we could have used spin_lock_irqsave() to prevent this race entirely but I'd perfer to avoid disabling interrupts as much as possible due to performance concerns. There is absolutely no penalty for us not aging objects out of the magazine due to contention. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes zfsonlinux/zfs#1193	2013-01-14 10:07:58 -08:00
Ned Bass	8842263bd0	call_usermodehelper() should wait for process As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-09 16:54:19 -08:00
Brian Behlendorf	1c7b3eaf87	RHEL 6.4 compat, fallocate() In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was introduced after the fallocate() function was moved from the inode_operations to the file_operations structure. Therefore, the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined it was safe to use f_ops->fallocate(). Unfortunately, the RHEL6.4 kernel has only backported the FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change. To address this compatibility issue the spl_filp_fallocate() helper function was added to properly detect which interface is available. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 09:53:13 -08:00
Matt Johnston	46a75aadb7	Add cv_wait_io() to account I/O time Under Linux when a task is waiting on I/O it should call the io_schedule() function for proper accounting. The Solaris cv_wait() function provides no way to specify what the cv is waiting on therefore cv_wait_io() is introduced. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #206	2013-01-07 10:29:26 -08:00
Brian Behlendorf	034f1b331e	Fix spl_kmem_init_kallsyms_lookup() panic Due to I/O buffering the helper may return successfully before the proc handler has a chance to execute. To catch this case wait up to 1 second to verify spl_kallsyms_lookup_name_fn was updated to a non SYMBOL_POISON value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#699 Closes zfsonlinux/zfs#859	2012-12-19 09:06:35 -08:00
Brian Behlendorf	33e94ef1dd	kmem-cache: Use a taskq for async allocations Shift the asynchronous allocations over to use the taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. This code never actually used the delay functionality it was just done this way to leverage the existing compatibility code. All that is required is a thread context to perform the allocation in. The only thing clever in this change is that we take advantage of the preallocated task queue entries to avoid a memory allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	a10287e00d	kmem-cache: Use taskqs for ageing Shift the cache and magazine ageing functionality over to the new delayed taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. However, the delayed taskq interface does not allow us to schedule a task for a specfic cpu so the ageing code was slightly reworked. The magazine ageing delay has been directly linked to the cache ageing function. The spl_cache_age() function invokes on_each_cpu() in order to run spl_magazine_age() on each cpu. It then blocks waiting for them to complete and promptly reclaims any free slabs. When restructing the code wasn't the primary goal I think the new code is far more understable and maintainable. It also should help minimize magazine thrashing because free slabs are immediately released after the magazine is aged. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	296a8e596d	kmem-cache: spl_kmem_cache_create() may always sleep When this code was originally written I went overboard and allowed for the possibility of creating a cache in an atomic context. In practice there are no callers which ever do this. This makes sense since a cache is by design a long lived data structure. To prevent abuse of this function going forward I'm removing the code which is supported to handle an atomic context. All allocators have been updated to use KM_SLEEP and the might_sleep() debug macro has been added to immediately detect atomic callers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	a5a98e7260	splat taskq:front: Reduce stack frame The slightly increased size of the taskq_ent_t when debugging is enabled has pushed the taskq:front splat test over frame size limit. To resolve this dynamically allocate the taskq_ent_t structures so they are part of the heap instead of the stack. In function 'splat_taskq_test6_impl' error: the frame size of 1648 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	94ff5d38e3	splat taskq:order: Reduce stack frame The slightly increased size of the taskq_ent_t when debugging is enabled has pushed the taskq:order splat test over frame size limit. To resolve this dynamically allocate the taskq_ent_t structures so they are part of the heap instead of the stack. In function 'splat_taskq_test5_impl' error: the frame size of 1680 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	3238e71763	splat taskq:cancel: Add test case Add a test case for taskq_cancel_id() to verify it is working properly. Just like taskq:delay we start by dispatching 100 tasks. However this time 1/3 of the tasks use taskq_dispatch() and will be run immediately, and 2/3 use taskq_dispatch_delay(). The idea is to create a busy taskq with both active, pending, and delayed tasks. After all the items have been successfully dispatched the test begins randomly canceling known task ids. It will do this for 5 seconds randomly canceling a task id and then sleeping for a few milliseconds. The task being canceled may have already run, still be on the pending list, or may be currently being executed by a worker thread. The idea is to ensure we catch any subtle race conditions. Once all the non-canceled tasks have completed we cross check the number of tasks which ran with the number of tasks which were successfully canceled. Additionally, we verify that the taskq_cancel_id() function never blocks longer than needed. This time is bounded by the longest run time of the task which was dispatched. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:49 -08:00
Brian Behlendorf	2f35782620	splat taskq:delay: Add test case Add a test case for taskq_dispatch_delay() to verify it is working properly. The test dispatchs 100 tasks to a taskq with random expiration times spread over 5 seconds. As each task expires and gets executed by a worker thread it verifies that it was run at the correct time. Once all the delayed tasks have been executed we double check that all the dispatched tasks were successful. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	d9acd930b5	taskq delay/cancel functionality Add the ability to dispatch a delayed task to a taskq. The desired behavior is for the task to be queued but not executed by a worker thread until the expiration time is reached. To achieve this two new functions were added. * taskq_dispatch_delay() - This function behaves exactly like taskq_dispatch() however it takes a third 'expire_time' argument. The caller should pass the desired time the task should be executed as an absolute value in jiffies. The task is guarenteed not to run before this time, it may run slightly latter if all the worker threads are busy. * taskq_cancel_id() - Given a task id attempt to cancel the task before it gets executed. This is primarily useful for canceling delay tasks but can be used for canceling any previously dispatched task. There are three possible return values. 0 - The task was found and canceled before it was executed. ENOENT - The task was not found, either it was already run or an invalid task id was supplied by the caller. EBUSY - The task is currently executing any may not be canceled. This function will block until the task has been completed. * taskq_wait_all() - The taskq_wait_id() function was renamed taskq_wait_all() to more clearly reflect its actual behavior. It is only curreny used by the splat taskq regression tests. * taskq_wait_id() - Historically, the only difference between this function and taskq_wait() was that you passed the task id. In both functions you would block until ALL lower task ids which executed. This was semantically correct but could be very slow particularly if there were delay tasks submitted. To better accomidate the delay tasks this function was reimplemnted. It will now only block until the passed task id has been completed. This is actually a fairly low risk change for a few reasons. * Only new ZFS callers will make use of the new interfaces and very little common code was changed to support the new functions. * The existing taskq_wait() implementation was not changed just slightly refactored. * The newly optimized taskq_wait_id() implementation was never used by ZFS we can't accidentally introduce a new bug there. NOTE: This functionality does not exist in the Illumos taskqs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	aed8671cb0	taskq style, remove #define wrappers When the taskq implementation was originally written I wrapped all the API functions in #define's. This was done as a preventative measure to ensure that a taskq symbol never conflicted with an existing kernel symbol. However, in practice the taskq symbols never conflicted. The only major conflicts occured with the kmem cache API. Since this added layer of obfuscation never bought us anything for the taskq's I'm removing it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	472a34caff	taskq style, convert spaces to soft tabs Update the taskq implementation to conform with the style used throughout the rest of the code. There are no functional changes in this commit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Steven Johnson	794f145bf9	splat linux:shrinker: Fix fail-safe Ensure the fail-safe is reset between successive tests. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:04:29 -08:00
Steven Johnson	ca072ee70f	splat linux:shrinker: Fix race condition Ensure the test thread blocks until the shrinker has completed its work. This is done by putting the test thread to sleep and waking it each time the shrinker callback runs. Once the shrinker size drops to zero or we time out the test is allowed to proceed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #96 Closes #125 Closes #182	2012-12-12 09:04:11 -08:00
Steven Johnson	9b88fa165f	splat taskq:front: Fix race The taskq:front test has a race condition where task 4 and 8 race to complete, due to an incorrectly calculated set of delay "factors" (T). If task 4 wins and actually finishes first, the verification of the order of completion will fail. The delays calculated to order task completion do not take into account the terminal line in the table, and so are all off by a factor of 1. This causes all the tasks in all queues to finish sooner than expected and the accumulated error is the root cause of tasks 4 and 8 racing to complete first. Before the change the "actual" table looks like I commented in #130. I changed: * the table in the comment to correctly reflect the test and the factor timings needed. * the individual task delay factors of T so that ONLY 1 task will every 2T. (on average) * 1T was reduced from 100ms to 50ms. This halves the duration of the test and makes any remaining raciness more likely to cause failures, but it did not cause the test to fail. * simplified the delay factor logic by using a table look-up instead of a switch. * Added a "task started" message so that with -v it is possible to see the order tasks are started. * Moved the "task completed" message inside the spinlock so that with -v the message truly reflects the absolute order of completion as guaranteed by the spinlock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #130	2012-12-05 12:23:40 -08:00
Brian Behlendorf	053678f3b0	Handle errors from spl_kern_path_locked() When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added by commit `bcb1589` an entirely new vn_remove() implementation was added. That function did not properly handle an error from spl_kern_path_locked() which would result in an panic. This patch addresses the issue by returning the error to the caller. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #187	2012-12-03 12:06:25 -08:00
Brian Behlendorf	b84412a6e8	Linux compat 3.7, kernel_thread() The preferred kernel interface for creating threads has been kthread_create() for a long time now. However, several of the SPLAT tests still use the legacy kernel_thread() function which has finally been dropped (mostly). Update the condvar and rwlock SPLAT tests to use the modern interface. Frankly this is something we should have done a long time ago. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #194	2012-12-03 09:36:21 -08:00
Brian Behlendorf	043f9b5724	Disable FS reclaim when allocating new slabs Allowing the spl_cache_grow_work() function to reclaim inodes allows for two unlikely deadlocks. Therefore, we clear __GFP_FS for these allocations. The two deadlocks are: * While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function calls kmem_cache_alloc() which happens to need to allocate a new slab. To allocate the new slab we enter FS level reclaim and attempt to evict several inodes. To evict these inodes we need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it just happens that obj1 and obj2 use the same hashed lock. * Similar to the first case however instead of getting blocked on the hash lock we block in txg_wait_open() which is waiting for the next txg which isn't coming because the txg_sync thread is blocked in kmem_cache_alloc(). Note this isn't a 100% fix because vmalloc() won't strictly honor __GFP_FS. However, it practice this is sufficient because several very unlikely things must all occur concurrently. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#1101	2012-11-27 13:43:27 -08:00
Brian Behlendorf	dc1b30224f	Never spin in kmem_cache_alloc() If we are reaping from the cache and a concurrent allocation occurs then the caller must block until the reaping is complete. This is signaled by the clearing of the KMC_BIT_REAPING bit. Otherwise the caller will be in a tight loop which takes and releases the skc->skc_cache lock. When there are multiple concurrent callers the system will thrash on the lock and appear to lock up. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 15:48:39 -08:00
Brian Behlendorf	a1af8fb1ea	Optimize spl_kmem_cache_free() Because only virtual slabs may have emergency objects and these objects are guaranteed to have physical addresses. It can be easily determined if the passed object is a virtual slab object or an emergency object. This allows us to completely optimize the emergency object free case out of the common free path. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:19 -08:00
Brian Behlendorf	ed3163484d	Track emergency object in rbtree In the initial implementation emergency objects were tracked on a per-cache list. The assumption was that under normal operation we would never allocate more than a handful of these objects. So the cost of walking the list during free was expected to be negligible. However real world usage has shown that emergency objects tend to be allocated in batches. A deadlock will be detected and several thousand emergency objects will be allocated before the original blocked slab allocation can complete. Therefore the original list has been replaced by a red black tree which is sorted by the memory address of each allocated object. This bounds the worst case insertion and removal time to O(log n) which minimize contention on the assoicated spin lock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:19 -08:00
Brian Behlendorf	165f13c33a	Improved vmem cached deadlock detection The entire goal of performing the slab allocations asynchronously is to be able to detect when a vmalloc() deadlocks. In this case, and only this case, do we want to start allocating emergency objects. The trick here is to minimize false positives because the overhead of tracking emergency objects is far higher than normal slab objects. With that goal in mind the code was reworked to be less sensitive to slow allocations by increasing the wait time. Once a cache is is marked deadlocked all subsequent allocations which can not be satisfied with existing cache objects will immediately allocate new emergency objects. This behavior persists until the asynchronous allocation completes and clears the deadlocked flag. The result of these tweaks is that far fewer emergency objects get created which is important because this minimizes the cost of releasing them latter in kmem_cache_free(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:15 -08:00
Brian Behlendorf	1112486356	splat kmem:slab_overcommit: Disabled Disable this test because it may result in an OOM event on the system which can result in the test infrastructure being killed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:57 -08:00
Brian Behlendorf	b8296bf3e6	splat atomic:64-bit: Create thread outside spin lock The Fedora 3.6 debug kernel identified the following issue where we create a thread under a spin lock. This isn't safe because sleeping could result in a deadlock. Therefore the lock is changed to a mutex so it's safe to sleep. BUG: sleeping function called from invalid context at mm/slub.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat 1 lock held by splat/10583: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:57 -08:00
Brian Behlendorf	0e149d4204	splat: Fix log buffer locking The Fedora 3.6 debug kernel identified the following issue where we call copy_to_user() under a spin lock(). This used to be safe in older kernels but no longer appears to be true so the spin lock was changed to a mutex. None of this code is performance critical so allowing the process to sleep is harmless. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:56 -08:00
Brian Behlendorf	df870a697f	splat: Cleanup headers Restructure the the SPLAT headers such that each test only includes the minimal set of headers it requires. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:56 -08:00
Brian Behlendorf	d2733258d0	Condition variable reference counts Reference count every entry and exit from the condition variable functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast(). This allows us to safely block in cv_destroy() until all consumers have been scheduled and are no longer accessing the condition variable memory. In addition poison the magic value at the start of cv_destroy() to ensure there are never any new callers after cv_destroy() is called. The consumer is responsible for ensuring this never occurs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:55 -08:00
Brian Behlendorf	dba79fcbf2	Add KSTAT_TYPE_TXG type Add a new kstat type for tracking useful statistics about a TXG. The new KSTAT_TYPE_TXG type can be used to tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-02 15:17:40 -07:00
Brian Behlendorf	71c9f0b003	Make kstat.ks_update() callback atomic Move the kstat ks_update() callback under the ks_lock. This enables dynamically sized kstats without modification to the kstat API. * Create a kstat with the KSTAT_FLAG_VIRTUAL flag. * Register a ->ks_update() callback which does: o Frees any existing ks_data buffer. o Set ks_data_size to the kstat array size. o Set ks_data to an allocated buffer of size ks_data_size o Populate the array of buffers with the required data. The buffer allocated in the ks_update() callback is guaranteed to remain allocated and valid while the proc sequence handler iterates over the buffer. The lock will not be dropped until kstat_seq_stop() function is run making it safe for concurrent access. To allow the ks_update() callback to perform memory allocations the lock was changed to a mutex. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-23 09:36:19 -07:00
Brian Behlendorf	1e0c2c2ccf	Linux 3.7 compat, __clear_close_on_exec() removed Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec() function out of include/linux/fdtable.h and in to fs/file.c making it unavailable to the SPL. Now as it turns out we only used this function to tear down some test infrastructure for the vn_getf()/vn_releasef() SPLAT regression tests. Rather than implement even more autoconf compatibilty code to handle this we just remove the test case. This also allows us to drop three existing autoconf tests. This does mean the SPLAT tests will no longer verify these functions but historically they have never been a problem. And if we feel we absolutely need this test coverage I'm sure a more portable version of the test case could be added. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #183	2012-10-18 13:36:44 -07:00
Yuxuan Shui	bcb15891ab	Linux 3.6 compat, kern_path_locked() added The kern_path_parent() function was removed from Linux 3.6 because it was observed that all the callers just want the parent dentry. The simpler kern_path_locked() function replaces kern_path_parent() and does the lookup while holding the ->i_mutex lock. This is good news for the vn implementation because it removes the need for us to handle the locking. However, it makes it harder to implement a single readable vn_remove()/vn_rename() function which is usually what we prefer. Therefore, we implement a new version of vn_remove()/vn_rename() for Linux 3.6 and newer kernels. This allows us to leave the existing working implementation untouched, and to add a simpler version for newer kernels. Long term I would very much like to see all of the vn code removed since what this code enabled is generally frowned upon in the kernel. But that can't happen util we either abondon the zpool.cache file or implement alternate infrastructure to update is correctly in user space. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #154	2012-10-14 16:26:21 -07:00
Massimo Maggi	dea3505dff	Switch KM_SLEEP to KM_PUSHPAGE In this particular instance the allocation occurred in the context of sys_msync()->...->zpl_putpage() where we must be careful not to initiate additional I/O. Signed-off-by: Massimo Maggi <massimo@mmmm.it> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-11 16:22:29 -07:00
Etienne Dechamps	bbdc6ae495	Add interface for file hole punching. This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #168	2012-10-04 16:22:07 -07:00
Brian Behlendorf	3050c9314f	Switch KM_SLEEP to KM_PUSHPAGE Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. This change was originally part of `cd5ca4b` but was reverted by `330fe01`. It always should have had its own commit for exactly this reason. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-12 12:27:09 -07:00

1 2 3 4 5 ...

382 Commits