Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	034f1b331e	Fix spl_kmem_init_kallsyms_lookup() panic Due to I/O buffering the helper may return successfully before the proc handler has a chance to execute. To catch this case wait up to 1 second to verify spl_kallsyms_lookup_name_fn was updated to a non SYMBOL_POISON value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#699 Closes zfsonlinux/zfs#859	2012-12-19 09:06:35 -08:00
Brian Behlendorf	33e94ef1dd	kmem-cache: Use a taskq for async allocations Shift the asynchronous allocations over to use the taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. This code never actually used the delay functionality it was just done this way to leverage the existing compatibility code. All that is required is a thread context to perform the allocation in. The only thing clever in this change is that we take advantage of the preallocated task queue entries to avoid a memory allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	a10287e00d	kmem-cache: Use taskqs for ageing Shift the cache and magazine ageing functionality over to the new delayed taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. However, the delayed taskq interface does not allow us to schedule a task for a specfic cpu so the ageing code was slightly reworked. The magazine ageing delay has been directly linked to the cache ageing function. The spl_cache_age() function invokes on_each_cpu() in order to run spl_magazine_age() on each cpu. It then blocks waiting for them to complete and promptly reclaims any free slabs. When restructing the code wasn't the primary goal I think the new code is far more understable and maintainable. It also should help minimize magazine thrashing because free slabs are immediately released after the magazine is aged. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	296a8e596d	kmem-cache: spl_kmem_cache_create() may always sleep When this code was originally written I went overboard and allowed for the possibility of creating a cache in an atomic context. In practice there are no callers which ever do this. This makes sense since a cache is by design a long lived data structure. To prevent abuse of this function going forward I'm removing the code which is supported to handle an atomic context. All allocators have been updated to use KM_SLEEP and the might_sleep() debug macro has been added to immediately detect atomic callers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	a5a98e7260	splat taskq:front: Reduce stack frame The slightly increased size of the taskq_ent_t when debugging is enabled has pushed the taskq:front splat test over frame size limit. To resolve this dynamically allocate the taskq_ent_t structures so they are part of the heap instead of the stack. In function 'splat_taskq_test6_impl' error: the frame size of 1648 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	94ff5d38e3	splat taskq:order: Reduce stack frame The slightly increased size of the taskq_ent_t when debugging is enabled has pushed the taskq:order splat test over frame size limit. To resolve this dynamically allocate the taskq_ent_t structures so they are part of the heap instead of the stack. In function 'splat_taskq_test5_impl' error: the frame size of 1680 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	3238e71763	splat taskq:cancel: Add test case Add a test case for taskq_cancel_id() to verify it is working properly. Just like taskq:delay we start by dispatching 100 tasks. However this time 1/3 of the tasks use taskq_dispatch() and will be run immediately, and 2/3 use taskq_dispatch_delay(). The idea is to create a busy taskq with both active, pending, and delayed tasks. After all the items have been successfully dispatched the test begins randomly canceling known task ids. It will do this for 5 seconds randomly canceling a task id and then sleeping for a few milliseconds. The task being canceled may have already run, still be on the pending list, or may be currently being executed by a worker thread. The idea is to ensure we catch any subtle race conditions. Once all the non-canceled tasks have completed we cross check the number of tasks which ran with the number of tasks which were successfully canceled. Additionally, we verify that the taskq_cancel_id() function never blocks longer than needed. This time is bounded by the longest run time of the task which was dispatched. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:49 -08:00
Brian Behlendorf	2f35782620	splat taskq:delay: Add test case Add a test case for taskq_dispatch_delay() to verify it is working properly. The test dispatchs 100 tasks to a taskq with random expiration times spread over 5 seconds. As each task expires and gets executed by a worker thread it verifies that it was run at the correct time. Once all the delayed tasks have been executed we double check that all the dispatched tasks were successful. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	d9acd930b5	taskq delay/cancel functionality Add the ability to dispatch a delayed task to a taskq. The desired behavior is for the task to be queued but not executed by a worker thread until the expiration time is reached. To achieve this two new functions were added. * taskq_dispatch_delay() - This function behaves exactly like taskq_dispatch() however it takes a third 'expire_time' argument. The caller should pass the desired time the task should be executed as an absolute value in jiffies. The task is guarenteed not to run before this time, it may run slightly latter if all the worker threads are busy. * taskq_cancel_id() - Given a task id attempt to cancel the task before it gets executed. This is primarily useful for canceling delay tasks but can be used for canceling any previously dispatched task. There are three possible return values. 0 - The task was found and canceled before it was executed. ENOENT - The task was not found, either it was already run or an invalid task id was supplied by the caller. EBUSY - The task is currently executing any may not be canceled. This function will block until the task has been completed. * taskq_wait_all() - The taskq_wait_id() function was renamed taskq_wait_all() to more clearly reflect its actual behavior. It is only curreny used by the splat taskq regression tests. * taskq_wait_id() - Historically, the only difference between this function and taskq_wait() was that you passed the task id. In both functions you would block until ALL lower task ids which executed. This was semantically correct but could be very slow particularly if there were delay tasks submitted. To better accomidate the delay tasks this function was reimplemnted. It will now only block until the passed task id has been completed. This is actually a fairly low risk change for a few reasons. * Only new ZFS callers will make use of the new interfaces and very little common code was changed to support the new functions. * The existing taskq_wait() implementation was not changed just slightly refactored. * The newly optimized taskq_wait_id() implementation was never used by ZFS we can't accidentally introduce a new bug there. NOTE: This functionality does not exist in the Illumos taskqs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	aed8671cb0	taskq style, remove #define wrappers When the taskq implementation was originally written I wrapped all the API functions in #define's. This was done as a preventative measure to ensure that a taskq symbol never conflicted with an existing kernel symbol. However, in practice the taskq symbols never conflicted. The only major conflicts occured with the kmem cache API. Since this added layer of obfuscation never bought us anything for the taskq's I'm removing it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	472a34caff	taskq style, convert spaces to soft tabs Update the taskq implementation to conform with the style used throughout the rest of the code. There are no functional changes in this commit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Steven Johnson	794f145bf9	splat linux:shrinker: Fix fail-safe Ensure the fail-safe is reset between successive tests. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:04:29 -08:00
Steven Johnson	ca072ee70f	splat linux:shrinker: Fix race condition Ensure the test thread blocks until the shrinker has completed its work. This is done by putting the test thread to sleep and waking it each time the shrinker callback runs. Once the shrinker size drops to zero or we time out the test is allowed to proceed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #96 Closes #125 Closes #182	2012-12-12 09:04:11 -08:00
Steven Johnson	9b88fa165f	splat taskq:front: Fix race The taskq:front test has a race condition where task 4 and 8 race to complete, due to an incorrectly calculated set of delay "factors" (T). If task 4 wins and actually finishes first, the verification of the order of completion will fail. The delays calculated to order task completion do not take into account the terminal line in the table, and so are all off by a factor of 1. This causes all the tasks in all queues to finish sooner than expected and the accumulated error is the root cause of tasks 4 and 8 racing to complete first. Before the change the "actual" table looks like I commented in #130. I changed: * the table in the comment to correctly reflect the test and the factor timings needed. * the individual task delay factors of T so that ONLY 1 task will every 2T. (on average) * 1T was reduced from 100ms to 50ms. This halves the duration of the test and makes any remaining raciness more likely to cause failures, but it did not cause the test to fail. * simplified the delay factor logic by using a table look-up instead of a switch. * Added a "task started" message so that with -v it is possible to see the order tasks are started. * Moved the "task completed" message inside the spinlock so that with -v the message truly reflects the absolute order of completion as guaranteed by the spinlock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #130	2012-12-05 12:23:40 -08:00
Brian Behlendorf	053678f3b0	Handle errors from spl_kern_path_locked() When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added by commit `bcb1589` an entirely new vn_remove() implementation was added. That function did not properly handle an error from spl_kern_path_locked() which would result in an panic. This patch addresses the issue by returning the error to the caller. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #187	2012-12-03 12:06:25 -08:00
Brian Behlendorf	b84412a6e8	Linux compat 3.7, kernel_thread() The preferred kernel interface for creating threads has been kthread_create() for a long time now. However, several of the SPLAT tests still use the legacy kernel_thread() function which has finally been dropped (mostly). Update the condvar and rwlock SPLAT tests to use the modern interface. Frankly this is something we should have done a long time ago. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #194	2012-12-03 09:36:21 -08:00
Brian Behlendorf	043f9b5724	Disable FS reclaim when allocating new slabs Allowing the spl_cache_grow_work() function to reclaim inodes allows for two unlikely deadlocks. Therefore, we clear __GFP_FS for these allocations. The two deadlocks are: * While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function calls kmem_cache_alloc() which happens to need to allocate a new slab. To allocate the new slab we enter FS level reclaim and attempt to evict several inodes. To evict these inodes we need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it just happens that obj1 and obj2 use the same hashed lock. * Similar to the first case however instead of getting blocked on the hash lock we block in txg_wait_open() which is waiting for the next txg which isn't coming because the txg_sync thread is blocked in kmem_cache_alloc(). Note this isn't a 100% fix because vmalloc() won't strictly honor __GFP_FS. However, it practice this is sufficient because several very unlikely things must all occur concurrently. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#1101	2012-11-27 13:43:27 -08:00
Brian Behlendorf	dc1b30224f	Never spin in kmem_cache_alloc() If we are reaping from the cache and a concurrent allocation occurs then the caller must block until the reaping is complete. This is signaled by the clearing of the KMC_BIT_REAPING bit. Otherwise the caller will be in a tight loop which takes and releases the skc->skc_cache lock. When there are multiple concurrent callers the system will thrash on the lock and appear to lock up. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 15:48:39 -08:00
Brian Behlendorf	a1af8fb1ea	Optimize spl_kmem_cache_free() Because only virtual slabs may have emergency objects and these objects are guaranteed to have physical addresses. It can be easily determined if the passed object is a virtual slab object or an emergency object. This allows us to completely optimize the emergency object free case out of the common free path. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:19 -08:00
Brian Behlendorf	ed3163484d	Track emergency object in rbtree In the initial implementation emergency objects were tracked on a per-cache list. The assumption was that under normal operation we would never allocate more than a handful of these objects. So the cost of walking the list during free was expected to be negligible. However real world usage has shown that emergency objects tend to be allocated in batches. A deadlock will be detected and several thousand emergency objects will be allocated before the original blocked slab allocation can complete. Therefore the original list has been replaced by a red black tree which is sorted by the memory address of each allocated object. This bounds the worst case insertion and removal time to O(log n) which minimize contention on the assoicated spin lock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:19 -08:00
Brian Behlendorf	165f13c33a	Improved vmem cached deadlock detection The entire goal of performing the slab allocations asynchronously is to be able to detect when a vmalloc() deadlocks. In this case, and only this case, do we want to start allocating emergency objects. The trick here is to minimize false positives because the overhead of tracking emergency objects is far higher than normal slab objects. With that goal in mind the code was reworked to be less sensitive to slow allocations by increasing the wait time. Once a cache is is marked deadlocked all subsequent allocations which can not be satisfied with existing cache objects will immediately allocate new emergency objects. This behavior persists until the asynchronous allocation completes and clears the deadlocked flag. The result of these tweaks is that far fewer emergency objects get created which is important because this minimizes the cost of releasing them latter in kmem_cache_free(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:15 -08:00
Brian Behlendorf	1112486356	splat kmem:slab_overcommit: Disabled Disable this test because it may result in an OOM event on the system which can result in the test infrastructure being killed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:57 -08:00
Brian Behlendorf	b8296bf3e6	splat atomic:64-bit: Create thread outside spin lock The Fedora 3.6 debug kernel identified the following issue where we create a thread under a spin lock. This isn't safe because sleeping could result in a deadlock. Therefore the lock is changed to a mutex so it's safe to sleep. BUG: sleeping function called from invalid context at mm/slub.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat 1 lock held by splat/10583: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:57 -08:00
Brian Behlendorf	0e149d4204	splat: Fix log buffer locking The Fedora 3.6 debug kernel identified the following issue where we call copy_to_user() under a spin lock(). This used to be safe in older kernels but no longer appears to be true so the spin lock was changed to a mutex. None of this code is performance critical so allowing the process to sleep is harmless. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:56 -08:00
Brian Behlendorf	df870a697f	splat: Cleanup headers Restructure the the SPLAT headers such that each test only includes the minimal set of headers it requires. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:56 -08:00
Brian Behlendorf	d2733258d0	Condition variable reference counts Reference count every entry and exit from the condition variable functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast(). This allows us to safely block in cv_destroy() until all consumers have been scheduled and are no longer accessing the condition variable memory. In addition poison the magic value at the start of cv_destroy() to ensure there are never any new callers after cv_destroy() is called. The consumer is responsible for ensuring this never occurs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:55 -08:00
Brian Behlendorf	dba79fcbf2	Add KSTAT_TYPE_TXG type Add a new kstat type for tracking useful statistics about a TXG. The new KSTAT_TYPE_TXG type can be used to tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-02 15:17:40 -07:00
Brian Behlendorf	71c9f0b003	Make kstat.ks_update() callback atomic Move the kstat ks_update() callback under the ks_lock. This enables dynamically sized kstats without modification to the kstat API. * Create a kstat with the KSTAT_FLAG_VIRTUAL flag. * Register a ->ks_update() callback which does: o Frees any existing ks_data buffer. o Set ks_data_size to the kstat array size. o Set ks_data to an allocated buffer of size ks_data_size o Populate the array of buffers with the required data. The buffer allocated in the ks_update() callback is guaranteed to remain allocated and valid while the proc sequence handler iterates over the buffer. The lock will not be dropped until kstat_seq_stop() function is run making it safe for concurrent access. To allow the ks_update() callback to perform memory allocations the lock was changed to a mutex. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-23 09:36:19 -07:00
Brian Behlendorf	1e0c2c2ccf	Linux 3.7 compat, __clear_close_on_exec() removed Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec() function out of include/linux/fdtable.h and in to fs/file.c making it unavailable to the SPL. Now as it turns out we only used this function to tear down some test infrastructure for the vn_getf()/vn_releasef() SPLAT regression tests. Rather than implement even more autoconf compatibilty code to handle this we just remove the test case. This also allows us to drop three existing autoconf tests. This does mean the SPLAT tests will no longer verify these functions but historically they have never been a problem. And if we feel we absolutely need this test coverage I'm sure a more portable version of the test case could be added. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #183	2012-10-18 13:36:44 -07:00
Yuxuan Shui	bcb15891ab	Linux 3.6 compat, kern_path_locked() added The kern_path_parent() function was removed from Linux 3.6 because it was observed that all the callers just want the parent dentry. The simpler kern_path_locked() function replaces kern_path_parent() and does the lookup while holding the ->i_mutex lock. This is good news for the vn implementation because it removes the need for us to handle the locking. However, it makes it harder to implement a single readable vn_remove()/vn_rename() function which is usually what we prefer. Therefore, we implement a new version of vn_remove()/vn_rename() for Linux 3.6 and newer kernels. This allows us to leave the existing working implementation untouched, and to add a simpler version for newer kernels. Long term I would very much like to see all of the vn code removed since what this code enabled is generally frowned upon in the kernel. But that can't happen util we either abondon the zpool.cache file or implement alternate infrastructure to update is correctly in user space. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #154	2012-10-14 16:26:21 -07:00
Massimo Maggi	dea3505dff	Switch KM_SLEEP to KM_PUSHPAGE In this particular instance the allocation occurred in the context of sys_msync()->...->zpl_putpage() where we must be careful not to initiate additional I/O. Signed-off-by: Massimo Maggi <massimo@mmmm.it> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-11 16:22:29 -07:00
Etienne Dechamps	bbdc6ae495	Add interface for file hole punching. This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #168	2012-10-04 16:22:07 -07:00
Brian Behlendorf	3050c9314f	Switch KM_SLEEP to KM_PUSHPAGE Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. This change was originally part of `cd5ca4b` but was reverted by `330fe01`. It always should have had its own commit for exactly this reason. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-12 12:27:09 -07:00
Brian Behlendorf	9b51f21841	Remove TQ_SLEEP -> KM_SLEEP mapping When the taskq code was originally written it seemed like a good idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this assumed that the TQ_* flags would never confict with any of the Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit `cd5ca4b` this invariant was accidentally broken. Therefore to support TQ_PUSHPAGE, which is needed for Linux, and prevent any further confusion I have removed this direct mapping. The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined in terms of their KM_* counterparts. Instead a simple mapping function is introduce to convert TQ_* -> KM_* where needed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #171	2012-09-12 11:41:42 -07:00
Brian Behlendorf	330fe010e4	Revert "Switch KM_SLEEP to KM_PUSHPAGE" This reverts commit `cd5ca4b2f8` due to conflicts in the higher TQ_ bits which caused incorrect behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-12 10:07:48 -07:00
Brian Behlendorf	3c60f5054c	Debug cv_destroy() with mutex held There still appears to be a race in the condition variables where ->cv_mutex is set after we are woken from the cv_destroy wait queue. This might be possible when cv_destroy() is called immediately after cv_broadcast(). We had some troubles with this previously but there may still be a small race, see commit `d599e4f`. The following patch closes one small race and improves the ASSERTs such that they log the offending value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> zfsonlinux/zfs#943	2012-09-10 10:23:26 -07:00
Brian Behlendorf	95331f4437	Set KMC_NOEMERGENCY for zlib workspaces The workspace required by zlib to perform compression is roughly 512MB (order-7). These allocations are so large that we should never attempt to directly kmalloc an emergency object for them. It is far preferable to asynchronously vmalloc an additional slab in case it's needed. Then simply block waiting for an existing object to be released or for the new slab to be allocated. This can be accomplished by disabling emergency slab objects by passing the KMC_NOEMERGENCY flag at slab creation time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> zfsonlinux/zfs#917	2012-09-07 14:36:26 -07:00
Brian Behlendorf	cb5c2acebb	Add KMC_NOEMERGENCY slab flag Provide a flag to disable the use of emergency objects for a specific kmem cache. There may be instances where under no circumstances should you kmalloc() an emergency object. For example, when you cache contains very large objects (>128k). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-07 14:27:03 -07:00
Brian Behlendorf	46b3945d5d	Suppress task_hash_table_init() large allocation warning When various kernel debuging options are enabled this allocation may be larger than usual as shown by the following warning. It is in no way harmful so we suppress the warning. SPL: large kmem_alloc(40960, 0x80d0) at tsd_hash_table_init:358 (76495/76495) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #93	2012-08-30 21:02:52 -07:00
Brian Behlendorf	efcd0ca32d	Enhance SPLAT kmem:slab_overcommit test After the emergency slab objects were merged I started observing timeout failures in the kmem:slab_overcommit test. These were due to the ineffecient way the slab_overcommit reclaim function was implemented. And due to the additional cost of potentially allocating ten of thousands of emergency objects and tracking them on a single list. This patch addresses the first concern by enhansing the test case to trace all of the allocations objects as a linked list. This allows for a cleaner version of the reclaim function to simply release SPLAT_KMEM_OBJ_RECLAIM objects. Since this touches some common code all the tests which share these data structions were also updated. After making these changes slab_overcommit is reliably passing. However, there is certainly additional cleanup which could be done here. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-30 15:49:00 -07:00
Brian Behlendorf	cd5ca4b2f8	Switch KM_SLEEP to KM_PUSHPAGE Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	500e95c884	Revert "Disable vmalloc() direct reclaim" This reverts commit `2092cf68d8`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	617f79de6a	Revert "Fix NULL deref in balance_pgdat()" This reverts commit `b8b6e4c453`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	bc03e07a7c	Revert "Detect kernels that honor gfp flags passed to vmalloc()" This reverts commit `36811b4430`. Which is no longer required because there is now SPL code in place to safely handle the deadlocks the kernel patch was designed to address. Therefore we can unconditionally use vmalloc() and drop all the PF_MEMALLOC code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	d47e664ad4	Revert "Add TASKQ_NORECLAIM flag" This reverts commit `372c257233`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:42 -07:00
Brian Behlendorf	e2dcc6e2b8	Emergency slab objects This patch is designed to resolve a deadlock which can occur with __vmalloc() based slabs. The issue is that the Linux kernel does not honor the flags passed to __vmalloc(). This makes it unsafe to use in a writeback context. Unfortunately, this is a use case ZFS depends on for correct operation. Fixing this issue in the upstream kernel was pursued and patches are available which resolve the issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 However, these changes were rejected because upstream felt that using __vmalloc() in the context of writeback should never be done. Their solution was for us to rewrite parts of ZFS to accomidate the Linux VM. While that is probably the right long term solution, and it is something we want to pursue, it is not a trivial task and will likely destabilize the existing code. This work has been planned for the 0.7.0 release but in the meanwhile we want to improve the SPL slab implementation to accomidate this expected ZFS usage. This is accomplished by performing the __vmalloc() asynchronously in the context of a work queue. This doesn't prevent the posibility of the worker thread from deadlocking. However, the caller can now safely block on a wait queue for the slab allocation to complete. Normally this will occur in a reasonable amount of time and the caller will be woken up when the new slab is available,. The objects will then get cached in the per-cpu magazines and everything will proceed as usual. However, if the __vmalloc() deadlocks for the reasons described above, or is just very slow, then the callers on the wait queues will timeout out. When this rare situation occurs they will attempt to kmalloc() a single minimally sized object using the GFP_NOIO flags. This allocation will not deadlock because kmalloc() will honor the passed flags and the caller will be able to make forward progress. As long as forward progress can be maintained then even if the worker thread is deadlocked the critical thread will make progress. This will eventually allow the deadlocked worker thread to complete and normal operation will resume. These emergency allocations will likely be slow since they require contiguous pages. However, their use should be rare so the impact is expected to be minimal. If that turns out not to be the case in practice further optimizations are possible. One additional concern is if these emergency objects are long lived. Right now they are simply tracked on a list which must be walked when an object is freed. Is they accumulate on a system and the list grows freeing objects will become more expensive. This could be handled relatively easily by using a hash instead of a list, but that optimization (if needed) is left for a follow up patch. Additionally, these emeregency objects could be repacked in to existing slabs as objects are freed if the kmem_cache_set_move() functionality was implemented. See issue https://github.com/zfsonlinux/spl/issues/26 for full details. This work would also help reduce ZFS's memory fragmentation problems. The /proc/spl/kmem/slab file has had two new columns added at the end. The 'emerg' column reports the current number of these emergency objects in use for the cache, and the following 'max' column shows the historical worst case. These value should give us a good idea of how often these objects are needed. Based on these values under real use cases we can tune the default behavior. Lastly, as a side benefit using a single work queue for the slab allocations should reduce cpu contention on the global virtual address space lock. This should manifest itself as reduced cpu usage for the system. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:42 -07:00
Prakash Surya	08850eddcb	Avoid calling smp_processor_id in spl_magazine_age The spl_magazine_age function had the implied assumption that it will remain on its current cpu through its execution. In order to support preempt enabled kernels, this assumption had to be removed. The spl_kmem_magazine structure now holds the cpu id of the cpu it is local to. This allows spl_magazine_age to use this field when scheduling work to be done by the magazine's local cpu. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #98	2012-08-24 09:43:22 -07:00
Richard Yao	15d0411297	Remove Makefile from non-toplevel .gitignore files When building SPL support into the kernel, ./copy-builtin will copy non-toplevel .gitignore files. These files list /Makefile, which causes git-archive to omit ./module/{spl,splat}/Makefile. The absence of these files result in build failures when SPL is selected. ZFS is unaffected because it puts Makefile in the toplevel .gitignore, which is not copied. We fix SPL by emulating that behavior. Reported-by: Fabio Erculiani <lxnay@gentoo.org> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #152	2012-08-23 12:49:04 -07:00
Prakash Surya	9baf44bc17	Wrap trace_set_debug_header in trace_[get\|put]_tcd To properly support CONFIG_PREEMPT enabled kernels, we must refrain from using a CPU index when preemption is enabled. As a result, this change moves the trace_set_debug_header call (which calls smp_processor_id) within trace_get_tcd and trace_put_tcd (which disable and enable preemption respectively). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #160	2012-08-23 10:01:20 -07:00
Richard Yao	6576a1a70d	Fix incorrect type in spl_kmem_cache_set_move() parameter A preprocessor definition renders this harmless. However, it is a good idea to change this to be consistent. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>	2012-08-01 16:35:18 -07:00
Etienne Dechamps	a9f2397ee9	Determine the hostid on demand. Currently, the SPL tries to determine the hostid at module load. The hostid is usually determined by running the userland program "hostid" during module initialization. Unfortunately, when the module initializes, it may be way too soon to be able to run any userland programs. This is especially true when the module is compiled directly inside the kernel (built-in); in that case, the SPL would try to run hostid when the kernel is still initializing, which of course is doomed to fail. This patch fixes the issue by deferring hostid generation until something actually needs the hostid (that is, when zone_get_hostid() is called), thus switching to a "on-initialization" model to a "on-demand" (lazy loading) model. ZFS only needs the hostid when some pool operations are requested, and this always happens way after the kernel has finished initialization, thus solving the problem. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 15:14:02 -07:00
Etienne Dechamps	c167aadb27	Add script for builtin module building. This commit introduces a "copy-builtin" script designed to prepare a kernel source tree for building SPL as a builtin module. The script makes a full copy of all needed files, thus making the kernel source tree fully independent of the spl source package. To achieve that, some compilation flags (-include, -I) have been moved to module/Makefile. This Makefile is only used when compiling external modules; when compiling builtin modules, a Kbuild file generated by the configure-builtin script is used instead. This makes sure Makefiles inside the kernel source tree does not contain references to the spl source package. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 15:13:09 -07:00
Etienne Dechamps	38b5ff4d07	Fix undefined reference on spl_mutex_spin_max(). Commit `3160d4f56b` changed the set of conditions under which spl_mutex_spin_max would be implemented as a function by changing an #if in sys/mutex.h. The corresponding implementation file spl-mutex.c, however, has not been updated to reflect the change. This results in undefined reference errors on spl_mutex_spin_max under the following condition: ((!CONFIG_SMP \|\| CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR) This patch fixes the issue by using the same #if in sys/mutex.h and spl-mutex.c. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 14:54:53 -07:00
Etienne Dechamps	94aac9c9bc	Use MODULE variable in module Makefile like zfs. In zfs, each module Makefile contains a MODULE variable which contains the name of the module, and the following declarations reference this variable. In spl, there is a MODULES variable which is never used. Rename it to MODULE and use it like in zfs. This improves consistency between the two build systems. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 14:53:48 -07:00
Brian Behlendorf	e8267acd25	32-bit compat, hostid_read() Explicitly cast the sizeof in hostid_read() to prevent the following compiler warning on 32-bit systems. module/spl/spl-generic.c:490:10: error: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'unsigned int' [-Werror=format] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-07-20 11:14:04 -07:00
Richard Yao	36811b4430	Detect kernels that honor gfp flags passed to vmalloc() zfsonlinux/spl@2092cf68d8 used PF_MEMALLOC to workaround a bug in the Linux kernel where allocations did not honor the gfp flags passed to vmalloc(). Unfortunately, PF_MEMALLOC has the side effect of permitting allocations to allocate pages outside of ZONE_NORMAL. This has been observed to result in the depletion of ZONE_DMA32. A kernel patch is available in the Gentoo bug tracker for this issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 This negates any benefit PF_MEMALLOC provides, so we introduce an autotools check to disable the use of PF_MEMALLOC on systems with patched kernels. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #126	2012-07-11 11:44:27 -07:00
Richard Yao	973e8269bd	Constify memory management functions This prevents warnings in ZFS that were caused by changes necessary to support PaX patched kernels. When debugging is enabled, these warnings become build failures. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #131	2012-07-03 16:07:27 -07:00
Brian Behlendorf	44e406d712	PowerPC Compatibility Usage of get_current() is not supported across all architectures. The correct interface to use is the '#define current' which will map to the appropriate function, usually current_thread_info(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #119	2012-07-02 09:33:09 -07:00
Richard Yao	e0093fea58	Linux 3.4 compat, __clear_close_on_exec replaces FD_CLR torvalds/linux@1dce27c5aa introduced __clear_close_on_exec() as a replacement for FD_CLR. Further commits appear to have removed FD_CLR from the Linux source tree. This causes the following failure: error: implicit declaration of function '__FD_CLR' [-Werror=implicit-function-declaration] To correct this we update the code to use the current __clear_close_on_exec() interface for readability. Then we introduce an autotools check to determine if __clear_close_on_exec() is available. If it isn't then we define some compatibility logic which used the older FD_CLR() interface. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #124	2012-06-13 16:18:51 -07:00
Brian Behlendorf	eaac9ba510	Fix uninit variable in slab reclaim test Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test as potentially unitialized. In practice this will never occur but to keep gcc happy we initialize the variable to zero. Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>	2012-06-13 16:17:22 -07:00
Brian Behlendorf	2371321e8a	Fix invalid context bug In the module unload path the vm_file_cache was being destroyed under a spin lock. Because this operation might sleep it was possible, although very very unlikely, that this could result in a deadlock. This issue was indentified by using a Linux debug kernel and has been fixed by moving the kmem_cache_destroy() out from under the spin lock. There is no need to lock this operation here. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#771	2012-06-11 09:17:45 -07:00
Jorgen Lundman	93b0dc92ea	Fix ARM 64-bit division Correctly implementating 64-bit division for ARM requires more than just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols. They are need to be implemented is such a way that the quotient and remainder and left in specific registers after the division operation completes. This change updates the wrapper functions to accomplish this according to the official ARM Run-time ABI. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#706	2012-05-22 09:27:11 -07:00
Brian Behlendorf	38d31a1e57	Remove Solaris module emulation Originally I believed that these interfaces would be needed. However, in practice it turned out that it was more straight forward and maintainable to use the native Linux interfaces. As such, this is all dead code and can be safely removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #109	2012-05-18 13:57:44 -07:00
Prakash Surya	a9a7a01cf5	Add SPLAT test to exercise slab direct reclaim This test is designed to verify that direct reclaim is functioning as expected. We allocate a large number of objects thus creating a large number of slabs. We then apply memory pressure and expect that the direct reclaim path can easily recover those slabs. The registered reclaim function will free the objects and the slab shrinker will call it repeatedly until at least a single slab can be freed. Note it may not be possible to reclaim every last slab via direct reclaim without a failure because the shrinker_rwsem may be contended. For this reason, quickly reclaiming 3/4 of the slabs is considered a success. This should all be possible within 10 seconds. For reference, on a system with 2G of memory this test takes roughly 0.2 seconds to run. It may take longer on larger memory systems but should still easily complete in the alloted 10 seconds. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #107	2012-05-07 11:55:59 -07:00
Brian Behlendorf	b78d4b9d98	Ensure a minimum of one slab is reclaimed To minimize the chance of triggering an OOM during direct reclaim. The kmem caches have been improved to make a best effort to reclaim at least one slab when a reclaim function is registered. This helps avoid the case where objects are released but they are spread over multiple slabs so no memory gets reclaimed. Care has been taken to avoid deadlocking if the reclaim function is unable to make forward progress. Additionally, the reclaim function may be skipped entirely if there are already free slabs which can be safely reaped. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #107	2012-05-07 11:54:28 -07:00
Brian Behlendorf	06089b9e19	Ensure direct reclaim forward progress The Linux direct reclaim path uses this out of band value to determine if forward progress is being made. Normally this is incremented by kmem_freepages() which is part of the various Linux slab implementations. However, since we are using none of that infrastructure we're responsible for incrementing this count. If no forward progress is detected and a subsequent allocation fails the OOM killer will be invoked. If there was forward progress additional reclaim will be attempted via the page cache and registerd shrinker until the allocation succeeds. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #107	2012-05-07 11:54:19 -07:00
Prakash Surya	c0e0fc14e3	Ignore slab cache age and delay in direct reclaim When memory pressure triggers direct memory reclaim, a slabs age and delay should not prevent it from being freed. This patch ensures these values are ignored, allowing an empty slab to be freed in this code path no matter the value of its age and delay. This prevents needless scanning of the partial slabs and has been observed to significantly reduce the total cpu usage. In addition, it should allow for snappier reclaim under memory pressure. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #102	2012-05-07 11:50:04 -07:00
Prakash Surya	cef7605c34	Throttle number of freed slabs based on nr_to_scan Previously, the SPL tried to maintain Solaris semantics by freeing all available (empty) slabs from its slab caches when the shrinker was called. This is not desirable when running on Linux. To make the SPL shrinker more Linux friendly, the actual number of freed slabs from each of the slab caches is now derived from nr_to_scan and skc_slab_objs. Additionally, an accounting bug was fixed in spl_slab_reclaim() which could cause us to reclaim one more slab than requested. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #101	2012-05-07 11:46:15 -07:00
Jorgen Lundman	ef6f91ce0c	Add missing 64-bit divide for 32-bit ARM Leverage the existing generic 64-bit division operations which were originally implemented for x86 to support ARM. All that is required is to make the symbols available to the linker with the expected names. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-05-03 10:07:54 -07:00
Prakash Surya	05b8f50c33	Update a comment to reflect new taskq internals As of the removal of the taskq work list made in commit: commit `2c02b71b14` Author: Prakash Surya <surya1@llnl.gov> Date: Mon Dec 5 17:32:48 2011 -0800 Replace tq_work_list and tq_threads in taskq_t To lay the ground work for introducing the taskq_dispatch_prealloc() interface, the tq_work_list and tq_threads fields had to be replaced with new alternatives in the taskq_t structure. the comment above taskq_wait_check has been incorrect. This change is an attempt at bringing that description more in line with the current implementation. Essentially, references to the old task work list had to be updated to reference the new taskq thread active list. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2012-04-30 10:49:15 -07:00
Brian Behlendorf	b29012b999	Remove condition variable names Long ago I added support to the spl for condition variable names because I thought they might be needed. It turns out they aren't. In fact the official Solaris cv_init(9F) man page discourages their use in the kernel. cv_init(9F) Parameters name - Descriptive string. This is obsolete and should be NULL. (Non-NULL strings are legal, but they're a waste of kernel memory.) Therefore, I'm removing them from the spl to reclaim this memory and adding an ASSERT() to ensure no new consumers are added which make use of the name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-06 12:06:19 -07:00
Brian Behlendorf	0835057ee7	Add SPL_META_RELEASE to module load/unload messages Include the ZFS_META_RELEASE in the module load/unload messages to more clearly indicate exactly what version of the SPL has been loaded. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:11:50 -07:00
Brian Behlendorf	9a8b7a7458	Add basic dynamic kstat support Add the bare minimum functionality to support dynamic kstats. A complete kstat implementation should be done as part of issue #84. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #84	2012-02-02 11:28:00 -08:00
Brian Behlendorf	4b2220f0b9	Add --enable-debug-log configure option Until now the notion of an internal debug logging infrastructure was conflated with enabling ASSERT()s. This patch clarifies things by cleanly breaking the two subsystem apart. The result of this is the following behavior. --enable-debug - Enable/disable code wrapped in ASSERT()s. --disable-debug ASSERT()s are used to check invariants and are never required for correct operation. They are disabled by default because they may impact performance. --enable-debug-log - Enable/disable the debug log infrastructure. --disable-debug-log This infrastructure allows the spl code and its consumer to log messages to an in-kernel log. The granularity of the logging can be controlled by a debug mask. By default the mask disables most debug messages resulting in a negligible performance impact. Because of this the debug log is enabled by default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:27:54 -08:00
Ned Bass	3c6ed5410b	Taskq locking optimizations Testing has shown that tq->tq_lock can be highly contended when a large number of small work items are dispatched. The lock hold time is reduced by the following changes: 1) Use exclusive threads in the work_waitq When a single work item is dispatched we only need to wake a single thread to service it. The current implementation uses non-exclusive threads so all threads are woken when the dispatcher calls wake_up(). If a large number of threads are in the queue this overhead can become non-negligible. 2) Conditionally add/remove threads from work waitq Taskq threads need only add themselves to the work wait queue if there are no pending work items. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32	2012-01-19 14:42:49 -08:00
Ned Bass	0bb43ca282	Revert "Taskq locking optimizations" This reverts commit `ec2b41049f`. A race condition was introduced by which a wake_up() call can be lost after the taskq thread determines there is no pending work items, leading to deadlock: 1. taksq thread enables interrupts 2. dispatcher thread runs, queues work item, call wake_up() 3. taskq thread runs, adds self to waitq, sleeps This could easily happen if an interrupt for an IO completion was outstanding at the point where the taskq thread reenables interrupts, just before the call to add_wait_queue_exclusive(). The handler would run immediately within the race window. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32	2012-01-19 14:42:39 -08:00
Ned Bass	ec2b41049f	Taskq locking optimizations Testing has shown that tq->tq_lock can be highly contended when a large number of small work items are dispatched. The lock hold time is reduced by the following changes: 1) Use exclusive threads in the work_waitq When a single work item is dispatched we only need to wake a single thread to service it. The current implementation uses non-exclusive threads so all threads are woken when the dispatcher calls wake_up(). If a large number of threads are in the queue this overhead can become non-negligible. 2) Conditionally add/remove threads from work waitq outside of tq_lock Taskq threads need only add themselves to the work wait queue if there are no pending work items. Furthermore, the add and remove function calls can be made outside of the taskq lock since the wait queues are protected from concurrent access by their own spinlocks. 3) Call wake_up() outside of tq->tq_lock Again, the wait queues are protected by their own spinlock, so the dispatcher functions can drop tq->tq_lock before calling wake_up(). A new splat test taskq:contention was added in a prior commit to measure the impact of these changes. The following table summarizes the results using data from the kernel lock profiler. tq_lock time %diff Wall clock (s) %diff original: 39117614.10 0 41.72 0 exclusive threads: 31871483.61 18.5 34.2 18.0 unlocked add/rm waitq: 13794303.90 64.7 16.17 61.2 unlocked wake_up(): 1589172.08 95.9 16.61 60.2 Each row reflects the average result over 5 test runs. /proc/lock_stats was zeroed out before and collected after each run. Column 1 is the cumulative hold time in microseconds for tq->tq_lock. The tests are cumulative; each row reflects the code changes of the previous rows. %diff is calculated with respect to "original" as 100*(orig-new)/orig. Although calling wake_up() outside of the taskq lock dramatically reduced the taskq lock hold time, the test actually took slightly more wall clock time. This is because the point of contention shifts from the taskq lock to the wait queue lock. But the change still seems worthwhile since it removes our taskq implementation as a bottleneck, assuming the small increase in wall clock time to be statistical noise. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #32	2012-01-18 10:36:57 -08:00
Ned Bass	cf5d23fa1e	Add taskq contention splat test Add a test designed to generate contention on the taskq spinlock by using a large number of threads (100) to perform a large number (131072) of trivial work items from a single queue. This simulates conditions that may occur with the zio free taskq when a 1TB file is removed from a ZFS filesystem, for example. This test should always pass. Its purpose is to provide a benchmark to easily measure the effectiveness of taskq optimizations using statistics from the kernel lock profiler. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32	2012-01-18 10:36:51 -08:00
Darik Horn	966e5200d3	Fix `make distclean` for `--with-config=user` Apply the same fix to SPL that was applied to ZFS earlier at: zfsonlinux/zfs@d433c20651 Additionally quote @LINUX_SYMBOLS@ because it is a null substitution in this configuration, which results in a `[ -f ]` expression that incorrectly evaluates to true. # ./configure --with-config=user # make distclean Making distclean in module make[1]: Entering directory `/spl/module' make -C SUBDIRS=`pwd` clean make: Entering an unknown directory make: *** SUBDIRS=/spl/module: No such file or directory. Stop. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-17 10:06:00 -08:00
Brian Behlendorf	5f6c14b1ed	Proxmox VE kernel compat, invalidate_inodes() The Proxmox VE kernel contains a patch which renames the function invalidate_inodes() to invalidate_inodes_check(). In the process it adds a 'check' argument and a '#define invalidate_inodes(x)' compatibility wrapper for legacy callers. Therefore, if either of these functions are exported invalidate_inodes() can be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #58	2011-12-21 14:29:45 -08:00
Prakash Surya	8f2503e0af	Store copy of tqent_flags prior to servicing task A preallocated taskq_ent_t's tqent_flags must be checked prior to servicing the taskq_ent_t. Once a preallocated taskq entry is serviced, the ownership of the entry is handed back to the caller of taskq_dispatch, thus the entry's contents can potentially be mangled. In particular, this is a problem in the case where a preallocated taskq entry is serviced, and the caller clears it's tqent_flags field. Thus, when the function returns and task_done is called, it looks as though the entry is not a preallocated task (when in fact it is a preallocated task). In this situation, task_done will place the preallocated taskq_ent_t structure onto the taskq_t's free list. This is a huge mistake. If the taskq_ent_t is then freed by the caller of taskq_dispatch, the taskq_t's free list will hold a pointer to garbage data. Even worse, if nothing has over written the freed memory before the pointer is dereferenced, it may still look as though it points to a valid list_head belonging to a taskq_ent_t structure. Thus, the task entry's flags are now copied prior to servicing the task. This copy is then checked to see if it is a preallocated task, and determine if the entry needs to be passed down to the task_done function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #71	2011-12-16 16:54:00 -08:00
Prakash Surya	e7e5f78e7b	Swap taskq_ent_t with taskqid_t in taskq_thread_t The taskq_t's active thread list is sorted based on its tqt_ent->tqent_id field. The list is kept sorted solely by inserting new taskq_thread_t's in their correct sorted location; no other means is used. This means that once inserted, if a taskq_thread_t's tqt_ent->tqent_id field changes, the list runs the risk of no longer being sorted. Prior to the introduction of the taskq_dispatch_prealloc() interface, this was not a problem as a taskq_ent_t actively being serviced under the old interface should always have a static tqent_id field. Thus, once the taskq_thread_t is added to the taskq_t's active thread list, the taskq_thread_t's tqt_ent->tqent_id field would remain constant. Now, this is no longer the case. Currently, if using the taskq_dispatch_prealloc() interface, any given taskq_ent_t actively being serviced _may_ have its tqent_id value incremented. This happens when the preallocated taskq_ent_t structure is recursively dispatched. Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id field silently modified from under its feet. If this were to happen to a taskq_thread_t on a taskq_t's active thread list, this would compromise the integrity of the order of the list (as the list _may_ no longer be sorted). To get around this, the taskq_thread_t's taskq_ent_t pointer was replaced with its own static copy of the tqent_id. So, as a taskq_ent_t is pulled off of the taskq_t's pending list, a static copy of its tqent_id is made and this copy is used to sort the active thread list. Using a static copy is key in ensuring the integrity of the order of the active thread list. Even if the underlying taskq_ent_t is recursively dispatched (as has its tqent_id modified), this static copy stored inside the taskq_thread_t will remain constant. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #71	2011-12-16 13:26:54 -08:00
Prakash Surya	699d5ee8a9	Exercise new taskq interface in splat-taskq tests The splat-taskq test functions were slightly modified to exercise the new taskq interface in addition to the old interface. If the old interface passes each of its tests, the new interface is exercised. Both sub tests (old interface and new interface) must pass for each test as a whole to pass. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #65	2011-12-13 16:10:57 -08:00
Prakash Surya	44217f7aad	Implement taskq_dispatch_prealloc() interface This patch implements the taskq_dispatch_prealloc() interface which was introduced by the following illumos-gate commit. It allows for a preallocated taskq_ent_t to be used when dispatching items to a taskq. This eliminates a memory allocation which helps minimize lock contention in the taskq when dispatching functions. commit 5aeb94743e3be0c51e86f73096334611ae3a058e Author: Garrett D'Amore <garrett@nexenta.com> Date: Wed Jul 27 07:13:44 2011 -0700 734 taskq_dispatch_prealloc() desired 943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	ac1e5b6033	Add Test: "Single task queue, recursive dispatch" Added another splat taskq test to ensure tasks can be recursively submitted to a single task queue without issue. When the taskq_dispatch_prealloc() interface is introduced, this use case can potentially cause a deadlock if a taskq_ent_t is dispatched while its tqent_list field is not empty. This _should_ never be a problem with the existing taskq_dispatch() interface. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	2c02b71b14	Replace tq_work_list and tq_threads in taskq_t To lay the ground work for introducing the taskq_dispatch_prealloc() interface, the tq_work_list and tq_threads fields had to be replaced with new alternatives in the taskq_t structure. The tq_threads field was replaced with tq_thread_list. Rather than storing the pointers to the taskq's kernel threads in an array, they are now stored as a list. In addition to laying the ground work for the taskq_dispatch_prealloc() interface, this change could also enable taskq threads to be dynamically created and destroyed as threads can now be added and removed to this list relatively easily. The tq_work_list field was replaced with tq_active_list. Instead of keeping a list of taskq_ent_t's which are currently being serviced, a list of taskq_threads currently servicing a taskq_ent_t is kept. This frees up the taskq_ent_t's tqent_list field when it is being serviced (i.e. now when a taskq_ent_t is being serviced, it's tqent_list field will be empty). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:50 -08:00
Prakash Surya	046a70c93b	Replace struct spl_task with struct taskq_ent The spl_task structure was renamed to taskq_ent, and all of its fields were renamed to have a prefix of 'tqent' rather than 't'. This was to align with the naming convention which the ZFS code assumes. Previously these fields were private so the name never mattered. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 12:28:09 -08:00
Prakash Surya	ed948fa72b	Add SPLAT_TEST_FINI call for SPLAT_TASKQ_TEST6_ID This change adds the neglected SPLAT_TEST_FINI call for the SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_* tests. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #64	2011-12-13 12:26:16 -08:00
Prakash Surya	e05bec805b	Fix a typo referencing an incorrect symbol The splat_taskq_test4_common function was incorrectly referencing the splat_taskq-test13_func symbol, when it meant to be using the splat_taskq_test4_func symbol. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #61	2011-11-21 16:52:36 -08:00
Brian Behlendorf	1114ae6ae7	Prepend spl_ to all init/fini functions This is a bit of cleanup I'd been meaning to get to for a while to reduce the chance of a type conflict. Well that conflict finally occurred with the kstat_init() function which conflicts with a function in the 2.6.32-6-pve kernel. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #56	2011-11-11 09:18:28 -08:00
Brian Behlendorf	fe71c0e567	Linux 3.1 compat, shrink_*cache_memory As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory functions have been removed. This same task is now accomplished more cleanly with per super block shrinkers. This unfortunately leaves us no easy way to support the dnlc_reduce_cache() function. This support has always been entirely optional. So when no reasonable interface is available allow the dnlc_reduce_cache() function to effectively become a no-op. The downside of this change is that it will prevent the zfs arc meta data limts from being enforced. However, the current zfs implementation in this regard is already flawed and needs to be reworked. If the arc needs to enfore a meta data limit it will need to be extended to coordinate directly with the zpl. This will allow us to drop all this compatibility code and get more fine grained control over the cache management. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 19:36:30 -08:00
Brian Behlendorf	12ff95ff57	Linux 3.1 compat, kern_path_parent() Prior to Linux 3.1 the kern_path_parent symbol was exported for use by kernel modules. As of Linux 3.1 it is now longer easily available. To handle this case the spl will now dynamically look up address of the missing symbol at module load time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 16:51:25 -08:00
Brian Behlendorf	b8b6e4c453	Fix NULL deref in balance_pgdat() Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering kv_alloc() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #287	2011-11-03 09:50:22 -07:00
Gunnar Beutner	f3989ed322	vn_rdwr() didn't properly advance the file position This would cause problems when using 'zfs send' with a file as the target (rather than a pipe or a socket as is usually the case) as for each write the destination offset in the file would be 0. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #391	2011-10-18 16:51:35 -07:00
Brian Behlendorf	ecc3981007	Fix various typos in comments Just clean up some of the typos and spelling mistakes in the comments of spl-kmem.c. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:32:49 -07:00
Gunnar Beutner	8d177c181f	Fixed typo in spl_slab_alloc() The typo did not have any effect (apart from a negligible performance impact) because skc->skc_flags * KMC_OFFSLAB is always non-null when at least one bit in skc->skc_flags is set. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:03:43 -07:00
Gunnar Beutner	64c075c3f4	Properly destroy work items in spl_kmem_cache_destroy() In a non-debug build the ASSERT() would be optimized away which could cause pending work items to not be cancelled. We must also use cancel_delayed_work_sync() rather than just cancel_delayed_work() to actually wait until work items have completed. Otherwise they might accidentally access free'd memory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS bugs #279, #62, #363, #418	2011-10-11 09:59:19 -07:00
Gunnar Beutner	763b2f3b57	Fixed invalid resource re-use in file_find() File descriptors are a per-process resource. The same descriptor in different processes can refer to different files. find_file() incorrectly assumed that file descriptors are globally unique. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #386	2011-10-11 09:51:51 -07:00
Brian Behlendorf	6b3b569df3	Remove /etc/hostid missing warning No longer print the following warning to the console when the /etc/hostid file is missing. This is the expected default behavior. Keeping the hostid in sync with the initramfs is now accomplished by creating the /etc/hostid in the initramfs not on the system. SPL: The /etc/hostid file is not found. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-06 14:58:09 -07:00
Brian Behlendorf	e80cd06b8e	Fix 'make install' overly broad 'rm' When running 'make install' without DESTDIR set the module install rules would mistakenly destroy the 'modules.*' files for ALL of your installed kernels. This could lead to a non-functional system for the alternate kernels because 'depmod -a' will only be run for the kernel which was compiled against. This issue would not impact anyone using the 'make <deb\|rpm\|pkg>' build targets to build and install packages. The fix for this issue is to only remove extraneous build products when DESTDIR is set. This almost exclusively indicates we are building packages and installed the build products in to a temporary staging location. Additionally, limit the removal the unneeded build products to the target kernel version. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #328	2011-07-20 09:37:41 -07:00
Darik Horn	0d54dcb566	Read the /etc/hostid file directly. Deprecate the /usr/bin/hostid call by reading the /etc/hostid file directly. Add the spl_hostid_path parameter to override the default /etc/hostid path. Rename the set_hostid() function to hostid_exec() to better reflect actual behavior and complement the new hostid_read() function. Use HW_INVALID_HOSTID as the spl_hostid sentinel value because zero seems to be a valid gethostid() result on Linux.	2011-06-24 09:58:03 -07:00
Brian Behlendorf	bf0c60c060	Add linux compatibility tests While the splat tests were originally designed to stress test the Solaris primatives. I am extending them to include some kernel compatibility tests. Certain linux APIs have changed frequently. These tests ensure that added compatibility is working properly and no unnoticed regression have slipped in. Test 1 and 2 add basic regression tests for shrink_icache_memory and shrink_dcache_memory. These are simply functional tests to ensure we can call these functions safely. Checking for correct behavior is more difficult since other running processes will influence the behavior. However, these functions are provided by the kernel so if we can successfully call them we assume they are working correctly. Test 3 checks that shrinker functions are being registered and called correctly. As of Linux 3.0 the shrinker API has changed four different times so I felt the need to add a trivial test case to ensure each variant works as expected.	2011-06-21 14:02:46 -07:00
Brian Behlendorf	a55bcaad18	Linux 3.0: Shrinker compatibility Update the the wrapper macros for the memory shrinker to handle this 4th API change. The callback function now takes a shrink_control structure. This is certainly a step in the right direction but it's annoying to have to accomidate yet another version of the API.	2011-06-21 14:02:39 -07:00
Brian Behlendorf	372c257233	Add TASKQ_NORECLAIM flag It has become necessary to be able to optionally disable direct memory reclaim for certain taskqs. To support this the TASKQ_NORECLAIM flags has been added which sets the PF_MEMALLOC bit for all threads in the taskq.	2011-05-06 15:23:58 -07:00
Darik Horn	c95b308d12	Correct typos in the spl proc handler. Correct a format typo that causes /proc/sys/kernel/spl/hostid to return a decimal number instead of a hexadecimal number.	2011-04-24 20:56:07 -05:00
Darik Horn	5b8f76ea16	Make the SPL kernel messages consistent with ZFS. Change the SPL kernel messages for module loading and module unloading so that they are similar to the ZFS kernel messages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-04-21 09:41:13 -07:00
Darik Horn	ad35b6a6e9	Remove the gawk dependency. This reverts commit `1814251453`. Demote the gawk call back to awk and ensure that stderr is attached. GNU gawk tolerates a missing stderr handle, but many utilities do not, which could be why a regular awk call was unexplainably failing on some systems. Use argv[0] instead of sh_path for consistency internally and with other Linux drivers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-04-21 09:41:09 -07:00
Darik Horn	fa6f7d8f9d	Import spl_hostid as a module parameter. Provide a call_usermodehelper() alternative by letting the hostid be passed as a module parameter like this: $ modprobe spl spl_hostid=0x12345678 Internally change the spl_hostid variable to unsigned long because that is the type that the coreutils /usr/bin/hostid returns. Move the hostid command into GET_HOSTID_CMD for consistency with the similar GET_KALLSYMS_ADDR_CMD invocation. Use argv[0] instead of sh_path for consistency internally and with other Linux drivers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-04-21 09:41:01 -07:00
Brian Behlendorf	3dfc591ac4	Linux 2.6.39 compat, zlib_deflate_workspacesize() The function zlib_deflate_workspacesize() now take 2 arguments. This was done to avoid always having to allocate the maximum size workspace (268K). The caller can now specific the windowBits and memLevel compression parameters to get a smaller workspace. For our purposes we introduce a spl_zlib_deflate_workspacesize() wrapper which accepts both arguments. When the two argument version of zlib_deflate_workspacesize() is available the arguments are passed through. When it's not we assume the worst case and a maximally sized workspace is used.	2011-04-20 14:39:15 -07:00
Brian Behlendorf	b1cbc4610c	Linux 2.6.39 compat, kern_path_parent() The path_lookup() function has been renamed to kern_path_parent() and the flags argument has been removed. The only behavior now offered is that of LOOKUP_PARENT. The spl already always passed this flag so dropping the flag does not impact us.	2011-04-20 12:30:17 -07:00
Brian Behlendorf	83c623aa1a	Linux 2.6.39 compat, DEFINE_SPINLOCK() This is a long over due compatibility change. Way, way, way back in 2007 there was a push to remove all consumers of SPIN_LOCK_UNLOCKED. Finally, in 2011 with 2.6.39 all the consumers have been updated and SPIN_LOCK_UNLOCKED was removed. It's about time we use the new API as well, this change does exactly that. DEFINE_SPINLOCK() was available as far back as 2.6.12 so there doesn't need to be any additional autoconf-foo for this change.	2011-04-20 12:01:11 -07:00
Brian Behlendorf	98e2afd1c5	Fix unused variable Flagged by the default -Wunused-but-set-variable gcc option when running under Fedora 15. Since it's correct this variable is entirely unused this commit removes it.	2011-04-19 09:45:36 -07:00
Brian Behlendorf	9b0f9079d2	Linux 2.6.39 compat, invalidate_inodes() To resolve a potiential filesystem corruption issue a second argument was added to invalidate_inodes(). This argument controls whether dirty inodes are dropped or treated as busy when invalidating a super block. When only the legacy API is available the second argument will be dropped for compatibility.	2011-04-19 09:08:08 -07:00
Brian Behlendorf	e76f4bf11d	Add dnlc_reduce_cache() support Provide the dnlc_reduce_cache() function which attempts to prune cached entries from the dcache and icache. After the entries are pruned any slabs which they may have been using are reaped. Note the API takes a reclaim percentage but we don't have easy access to the total number of cache entries to calculate the reclaim count. However, in practice this doesn't need to be exactly correct. We simply need to reclaim some useful fraction (but not all) of the cache. The caller can determine if more needs to be done.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	3336e29cc2	Add slab usage summeries to /proc One of the most common things you want to know when looking at the slab is how much memory is being used. This information was available in /proc/spl/kmem/slab but only on a per-slab basis. This commit adds the following /proc/sys/kernel/spl/kmem/slab* entries to make total slab usage easily available at a glance. slab_kmem_total - Total kmem slab size slab_kmem_avail - Alloc'd kmem slab size slab_kmem_max - Max observed kmem slab size slab_vmem_total - Total vmem slab size slab_vmem_avail - Alloc'd vmem slab size slab_vmem_max - Max observed vmem slab size NOTE: The slab_*_max values are expected to over report because they show maximum values since boot, not current values.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	d0a1038ff3	Update /proc/spl/kmem/slab output The 'slab_fail', 'slab_create', and 'slab_destroy' columns in the slab output have been removed because they are virtually always zero and not very useful. The much more useful 'size' and 'alloc' columns have been added which show the total slab size and how much of the total size has been allocated to objects. Finally, the formatting has been updated to be much more human readable while still being friendly for tool like awk to parse.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	495bd532ab	Linux shrinker compat The Linux shrinker has gone through three API changes since 2.6.22. Rather than force every caller to understand all three APIs this change consolidates the compatibility code in to the mm-compat.h header. The caller then can then use a single spl provided shrinker API which does the right thing for your kernel. SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask); SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks); spl_register_shrinker(&shrinker_struct); spl_unregister_shrinker(&&shrinker_struct); spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);	2011-04-06 20:06:03 -07:00
Brian Behlendorf	734fcac78d	Add crgetfsuid()/crgetfsgid() helpers Solaris credentials don't have an fsuid/fsguid field but Linux credentials do. To handle this case the Solaris API is being modestly extended to include the crgetfsuid()/crgetfsgid() helper functions. Addititionally, because the crget*() helpers are implemented identically regardless of HAVE_CRED_STRUCT they have been moved outside the #ifdef to common code. This simplification means we only have one version of the helper to keep to to date.	2011-03-22 12:18:44 -07:00
Brian Behlendorf	2092cf68d8	Disable vmalloc() direct reclaim As part of vmalloc() a __pte_alloc_kernel() allocation may occur. This internal allocation does not honor the gfp flags passed to vmalloc(). This means even when vmalloc(GFP_NOFS) is called it is possible that a synchronous reclaim will occur. This reclaim can trigger file IO which can result in a deadlock. This issue can be avoided by explicitly setting PF_MEMALLOC on the process to subvert synchronous reclaim when vmalloc() is called with !__GFP_FS. An example stack of the deadlock can be found here (1), along with the upstream kernel bug (2), and the original bug discussion on the linux-mm mailing list (3). This code can be properly autoconf'ed when the upstream bug is fixed. 1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133 2) http://bugzilla.kernel.org/show_bug.cgi?id=30702 3) http://marc.info/?l=linux-mm&m=128942194520631&w=4	2011-03-20 15:12:08 -07:00
Brian Behlendorf	47995fa691	Remove xvattr support The xvattr support in the spl has always simply consisted of defining a couple structures and a few #defines. This was enough to enable compilation of code which just passed xvattr types around but not enough to effectively manipulate them. This change removes even this minimal support leaving it up to packages which leverage the spl to prove the full xvattr support. By removing it from the spl we ensure not conflict with the higher level packages. This just leaves minimal vnode support for basical manipulation of files. This code is does have the proper support functions in the spl and a set of regression tests. Additionally, this change removed the unused 'caller_context_t ' type and replaces it with a 'void '.	2011-03-02 11:34:46 -08:00
Brian Behlendorf	19c1eb829d	Add zlib regression test A zlib regression test has been added to verify the correct behavior of z_compress_level() and z_uncompress. The test case simply takes a 128k buffer, it compresses the buffer, it them uncompresses the buffer, and finally it compares the buffers after the transform. If the buffers match then everything is fine and no data was lost. It performs this test for all 9 zlib compression levels.	2011-02-25 16:56:46 -08:00
Brian Behlendorf	5c1967ebe2	Fix zlib compression While portions of the code needed to support z_compress_level() and z_uncompress() where in place. In reality the current implementation was non-functional, it just was compilable. The critical missing component was to setup a workspace for the compress/uncompress stream structures to use. A kmem_cache was added for the workspace area because we require a large chunk of memory. This avoids to need to continually alloc/free this memory and vmap() the pages which is very slow. Several objects will reside in the per-cpu kmem_cache making them quick to acquire and release. A further optimization would be to adjust the implementation to additional ensure the memory is local to the cpu. Currently that may not be the case.	2011-02-25 16:56:22 -08:00
Brian Behlendorf	914b063133	Linux compat 2.6.37, invalidate_inodes() In the 2.6.37 kernel the function invalidate_inodes() is no longer exported for use by modules. This memory management functionality is needed to invalidate the inodes attached to a super block without unmounting the filesystem. Because this function still exists in the kernel and the prototype is available is a common header all we strictly need is the symbol address. The address is obtained using spl_kallsyms_lookup_name() and assigned to the variable invalidate_inodes_fn. Then a #define is used to replace all instances of invalidate_inodes() with a call to the acquired address. All the complexity is hidden behind HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual. Long term we should try to get this, or another, interface made available to modules again.	2011-02-23 12:44:32 -08:00
Brian Behlendorf	d599e4fa79	Block in cv_destroy() on all waiters Previously we would ASSERT in cv_destroy() if it was ever called with active waiters. However, I've now seen several instances in OpenSolaris code where they do the following: cv_broadcast(); cv_destroy(); This leaves no time for active waiters to be woken up and scheduled and we trip the ASSERT. This has not been observed to be an issue on OpenSolaris because their cv_destroy() basically does nothing. They still do run the risk of the memory being free'd after the cv_destroy() and hitting a bad paging request. But in practice this race is so small and unlikely it either doesn't happen, or is so unlikely when it does happen the root cause has not yet been identified. Rather than risk the same issue in our code this change updates cv_destroy() to block until all waiters have been woken and scheduled. This may take some time because each waiter must acquire the mutex. This change may have an impact on performance for frequently created and destroyed condition variables. That however is a price worth paying it avoid crashing your system. If performance issues are observed they can be addressed by the caller.	2011-02-04 14:09:08 -08:00
Brian Behlendorf	647fa73cf3	Remove VN_HOLD/VN_RELE/VOP_PUTPAGE Previously these were defined to noops but rather than give the misleading impression that these are actually implemented I'm removing the type entirely for clarity.	2011-01-12 11:38:05 -08:00
Brian Behlendorf	a5b40eed17	Make vn_cache\|vn_file_cache kmem caches Both of these caches were previously allowed to be either a vmem or kmem cache based on the size of the object involved. Since we know the object won't be to large and performce is much better for a kmem cache for them to be kmem backed.	2011-01-12 11:38:05 -08:00
Brian Behlendorf	dcd9cb5a17	Clean vattr_t and vsecattr_t types Minor cleanup for the vattr_t and vsecattr_t types.	2011-01-12 11:38:04 -08:00
Brian Behlendorf	4295b530ee	Add vn_mode_to_vtype/vn_vtype to_mode helpers Add simple helpers to convert a vnode->v_type to a inode->i_mode. These should be used sparingly but they are handy to have.	2011-01-12 11:38:04 -08:00
Neependra Khare	3f688a8c38	Add cv_timedwait_interruptible() function The cv_timedwait() function by definition must wait unconditionally for cv_signal()/cv_broadcast() before waking. This causes processes to go in the D state which increases the load average. The load average is the summation of processes in D state and run queue. To avoid this it can be desirable to sleep interruptibly. These processes do not count against the load average but may be woken by a signal. It is up to the caller to determine why the process was woken it may be for one of three reasons. 1) cv_signal()/cv_broadcast() 2) the timeout expired 3) a signal was received Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-01-11 12:14:48 -08:00
Brian Behlendorf	6bf4d76f47	Linux Compat: inode->i_mutex/i_sem Create spl_inode_lock/spl_inode_unlock compability macros to simply access to the inode mutex/sem. This avoids the need to have to ugly up the code with the required #define's at every call site. At the moment the SPL only uses this in one place but higher layers can benefit from the macro.	2011-01-11 12:14:48 -08:00
Brian Behlendorf	b7dc313837	Add Thread Specific Data (TSD) Regression Test To validate the correct behavior of the TSD interfaces it's important that we add a regression test. This test is designed to minimally exercise the fundamental TSD behavior, it does not attempt to validate all potential corner cases. The test will first create 32 keys via tsd_create() and register a common destructor. Next 16 wait threads will be created each of which set/verify a random value for all 32 keys, then block waiting to be released by the control thread. Meanwhile the control thread verifies that none of the destructors have been run prematurely. The next phase of the test is to create 16 exit threads which set/verify a random value for all 32 keys. They then immediately exit. This is is designed to verify tsd_exit() which will be called via thread_exit(). This must result in all registered destructors being run and the memory for the tsd being free'd. After this tsd_destroy() is verified by destroying all 32 keys. Once again we must see the expected number of destructors run and the tsd memory free'd. At this point the blocked threads are released and they exit calling tsd_exit() which should do very little since all the tsd has already been destroyed. If this all goes off without a hitch the test passes. To ensure no memory has been leaked, I have manually verified that after spl module unload no memory is reported leaked.	2010-12-07 10:02:44 -08:00
Brian Behlendorf	9fe45dc1ac	Add Thread Specific Data (TSD) Implementation Thread specific data has implemented using a hash table, this avoids the need to add a member to the task structure and allows maximum portability between kernels. This implementation has been optimized to keep the tsd_set() and tsd_get() times as small as possible. The majority of the entries in the hash table are for specific tsd entries. These entries are hashed by the product of their key and pid because by design the key and pid are guaranteed to be unique. Their product also has the desirable properly that it will be uniformly distributed over the hash bins providing neither the pid nor key is zero. Under linux the zero pid is always the init process and thus won't be used, and this implementation is careful to never to assign a zero key. By default the hash table is sized to 512 bins which is expected to be sufficient for light to moderate usage of thread specific data. The hash table contains two additional type of entries. They first type is entry is called a 'key' entry and it is added to the hash during tsd_create(). It is used to store the address of the destructor function and it is used as an anchor point. All tsd entries which use the same key will be linked to this entry. This is used during tsd_destory() to quickly call the destructor function for all tsd associated with the key. The 'key' entry may be looked up with tsd_hash_search() by passing the key you wish to lookup and DTOR_PID constant as the pid. The second type of entry is called a 'pid' entry and it is added to the hash the first time a process set a key. The 'pid' entry is also used as an anchor and all tsd for the process will be linked to it. This list is using during tsd_exit() to ensure all registered destructors are run for the process. The 'pid' entry may be looked up with tsd_hash_search() by passing the PID_KEY constant as the key, and the process pid. Note that tsd_exit() is called by thread_exit() so if your using the Solaris thread API you should not need to call tsd_exit() directly.	2010-12-07 10:02:32 -08:00
Brian Behlendorf	058de03caa	Clear cv->cv_mutex when not in use For debugging purposes the condition varaibles keep track of the mutex used during a wait. The idea is to validate that all callers always use the same mutex. Unfortunately, we have seen cases where the caller reuses the condition variable with a different mutex but in a way which is known to be safe. My reading of the man pages suggests you should not do this and always cv_destroy()/cv_init() a new mutex. However, there is overhead in doing this and it does appear to be allowed under Solaris. To accomidate this behavior cv_wait_common() and __cv_timedwait() have been modified to clear the associated mutex when the last waiter is dropped. This ensures that while the condition variable is in use the incorrect mutex case is detected. It also allows the condition variable to be safely recycled without requiring the overhead of a cv_destroy()/cv_init() as long as it isn't currently in use. Finally, spin lock cv->cv_lock was removed because it is not required. When the condition variable is used properly the caller will always be holding the mutex so the spin lock is redundant. The lock was originally added because I expected to need to protect more than just the cv->cv_mutex. It turns out that was not the case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:02:34 -08:00
Brian Behlendorf	8655ce492f	Linux 2.6.36 compat, use fops->unlocked_ioctl() As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has been removed and thus fops()->ioctl() has also been removed. The replacement hook is fops->unlocked_ioctl() which has existed in kernel since 2.6.12. Since the SPL only contains support back to 2.6.18 vintage kernels, I'm not adding an autoconf check for this and simply moving everything to use fops->unlocked_ioctl().	2010-11-10 13:16:12 -08:00
Brian Behlendorf	9b2048c26b	Linux 2.6.36 compat, fs_struct->lock type change In the linux-2.6.36 kernel the fs_struct lock was changed from a rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd() symbol by default this would not have caused us any issues, but they don't. So we're forced to add a new autoconf check which sets the HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can then correctly use either spin_lock or write_lock in our custom set_fs_pwd() implementation.	2010-11-09 13:29:47 -08:00
Brian Behlendorf	1e18307b61	Fix incorrect krw_type_t type Flagged by the default compile options on archlinux 2010.05, we should be using the krw_t type not the krw_type_t type in the private data. module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’: module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in enumerated type ‘krw_type_t’	2010-11-09 10:18:01 -08:00
Brian Behlendorf	23aa63cbf5	Fix 2.6.35 shrinker callback API change As of linux-2.6.35 the shrinker callback API now takes an additional argument. The shrinker struct is passed to the callback so that users can embed the shrinker structure in private data and use container_of() to access it. This removes the need to always use global state for the shrinker. To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf check to properly detect the API. Then we simply setup a callback function with the correct number of arguments. For now we do not make use of the new 3rd argument.	2010-10-22 14:51:26 -07:00
Brian Behlendorf	a7958f7eef	Support custom build directories One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz tar -xzf spl-x.y.z.tar.gz cd spl-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This is something the project has almost supported for a long time but finishing this support should save me lots of time.	2010-09-05 21:49:05 -07:00
Brian Behlendorf	2b3543025c	Stub out kmem cache defrag API At some point we are going to need to implement the kmem cache move callbacks to allow for kmem cache defragmentation. This commit simply lays a small part of the API ground work, it does not actually implement any of this feature. This is safe for now because the move callbacks are just an optimization. Even if they are registered we don't ever really have to call them.	2010-08-27 14:23:42 -07:00
Li Wei	4be55565fe	Fix stack overflow in vn_rdwr() due to memory reclaim Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp mask we may enter memory reclaim during IO. In this case shrink_slab() entered another file system which is notoriously hungry for stack. This additional stack usage may cause a stack overflow. This patch removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The original mask is then restored at vn_close() time. Hats off to the loop driver which does something similiar for the same reason. [...] shrink_slab+0xdc/0x153 try_to_free_pages+0x1da/0x2d7 __alloc_pages+0x1d7/0x2da do_generic_mapping_read+0x2c9/0x36f file_read_actor+0x0/0x145 __generic_file_aio_read+0x14f/0x19b generic_file_aio_read+0x34/0x39 do_sync_read+0xc7/0x104 vfs_read+0xcb/0x171 :spl:vn_rdwr+0x2b8/0x402 :zfs:vdev_file_io_start+0xad/0xe1 [...] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-12 09:34:33 -07:00
Ricardo M. Correia	26f7245c7c	Fix taskq code to not drop tasks when TQ_SLEEP is used. When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the number of pending tasks is above tq->tq_maxalloc. This semantic is similar to KM_SLEEP in kmem allocations, which also always succeed. However, we cannot block forever otherwise there is a risk of deadlock. Therefore, we still allow the number of pending tasks to go above tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task dispatch, thereby throttling the task dispatch rate. One of the existing splat tests was also augmented to test for this scenario. The test would fail with the previous implementation but now it succeeds. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-02 11:20:31 -07:00
Brian Behlendorf	41f84a8d56	Strfree() should call kfree() not kmem_free() Using kmem_free() results in deducting X bytes from the memory accounting when --enable-debug is set. Unfortunately, currently the counterpart kmem_asprintf() and friends do not properly account for memory allocated, so we must do the same on free. If we don't then we end up with a negative number of lost bytes reported when the module is unloaded. A better long term fix would be to add the accounting in to the allocation side but that's a project for another day.	2010-07-30 22:20:58 -07:00
Brian Behlendorf	099dc9c2d2	Add uninstall Makefile targets Extend the Makefiles with an uninstall target to cleanly remove a package which was installed with 'make install'. Additionally, ensure a 'depmod -a' is run as part of the install to update the module dependency information.	2010-07-28 14:55:32 -07:00
Brian Behlendorf	10129680f8	Ensure kmem_alloc() and vmem_alloc() never fail The Solaris semantics for kmem_alloc() and vmem_alloc() are that they must never fail when called with KM_SLEEP. They may only fail if called with KM_NOSLEEP otherwise they must block until memory is available. This is quite different from how the Linux memory allocators work, under Linux a memory allocation failure is always possible and must be dealt with. At one point in the past the kmem code did properly implement this behavior, however as the code evolved this behavior was overlooked in places. This patch goes through all three implementations of the kmem/vmem allocation functions and ensures that they will all block in the KM_SLEEP case when memory is not available. They may still fail in the KM_NOSLEEP case in which case the caller is responsible for handling the failure. Special care is taken in vmalloc_nofail() to avoid thrashing the system on the virtual address space spin lock. The down side of course is if you do see a failure here, which is unlikely for 64-bit systems, your allocation will delay for an entire second. Still this is preferable to locking up your system and it is the best we can do given the constraints. Additionally, the code was cleaned up to be much more readable and comments were added to describe the various kmem-debug-* configure options. The default configure options remain: "--enable-debug-kmem --disable-debug-kmem-tracking"	2010-07-26 15:47:55 -07:00
Brian Behlendorf	849c50e7f2	Fix two minor compiler warnings In cmd/splat.c there was a comparison between an __u32 and an int. To resolve the issue simply use a __u32 and strtoul() when converting the provided user string. In module/spl/spl-vnode.c we should explicitly cast nd->last.name to a const char * which is what is expected by the prototype.	2010-07-26 10:24:26 -07:00
Brian Behlendorf	8b0eb3f0dc	Remove deadcode caused by removal of format1 arg Commit `55abb0929e` removed the never used format1 argument of spl_debug_msg(). That in turn resulted in some deadcode which should be removed since it's now useless.	2010-07-21 16:31:42 -07:00
Ricardo M. Correia	81672c0122	Display DEBUG keyword during module load when --enable-debug is used. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 15:31:03 -07:00
Ricardo M. Correia	2c762de830	Fix buggy kmem_{v}asprintf() functions When the kvasprintf() call fails they should reset the arguments by calling va_start()/va_copy() and va_end() inside the loop, otherwise they'll try to read more arguments rather than starting over and reading them from the beginning. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 13:51:46 -07:00
Brian Behlendorf	b17edc10a9	Prefix all SPL debug macros with 'S' To avoid conflicts with symbols defined by dependent packages all debugging symbols have been prefixed with a 'S' for SPL. Any dependent package needing to integrate with the SPL debug should include the spl-debug.h header and use the 'S' prefixed macros. They must also build with DEBUG defined.	2010-07-20 13:30:40 -07:00
Brian Behlendorf	55abb0929e	Split <sys/debug.h> header To avoid symbol conflicts with dependent packages the debug header must be split in to several parts. The <sys/debug.h> header now only contains the Solaris macro's such as ASSERT and VERIFY. The spl-debug.h header contain the spl specific debugging infrastructure and should be included by any package which needs to use the spl logging. Finally the spl-trace.h header contains internal data structures only used for the log facility and should not be included by anythign by spl-debug.c. This way dependent packages can include the standard Solaris headers without picking up any SPL debug macros. However, if the dependant package want to integrate with the SPL debugging subsystem they can then explicitly include spl-debug.h. Along with this change I have dropped the CHECK_STACK macros because the upstream Linux kernel now has much better stack depth checking built in and we don't need this complexity. Additionally SBUG has been replaced with PANIC and provided as part of the Solaris macro set. While the Solaris version is really panic() that conflicts with the Linux kernel so we'll just have to make due to PANIC. It should rarely be called directly, the prefered usage would be an ASSERT or VERIFY. There's lots of change here but this cleanup was overdue.	2010-07-20 13:29:35 -07:00
Ned Bass	8f813bb168	Proposed fix for oops on SIGINT in splat atomic:64-bit test. The threads in the splat atomic:64-bit test share the data structure atomic_priv_t ap, which lives on the kernel stack of the splat user-space utility. If splat terminates before the threads, accesses to that memory location by the other threads become invalid. Splat synchronizes with the threads with the call: wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i)); Apparently, the SIGINT wakes and terminates splat prematurely, so that GPFs or other bad things happen when the threads subsequently access ap. This commit prevents this by using the uninterruptible form: wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));	2010-07-15 12:50:15 -07:00
Brian Behlendorf	d0bd694ca9	Fix -Werror=format-security compiler option Noticed under Ubuntu kernel builds we should be passing a format specifier and the string, not just the string.	2010-07-14 11:53:57 -07:00
Brian Behlendorf	f0ff89fc86	Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry ' The prototype for filp_fsync() drop the unused argument 'stuct dentry '. I've fixed this by adding the needed autoconf check and moving all of those filp related functions to file_compat.h. This will simplify handling any further API changes in the future.	2010-07-14 11:40:55 -07:00
Brian Behlendorf	a4bfd8ea1b	Add __divdi3(), remove __udivdi3() kernel dependency Up until now no SPL consumer attempted to perform signed 64-bit division so there was no need to support this. That has now changed so I adding 64-bit division support for 32-bit platforms. The signed implementation is based on the unsigned version. Since the have been several bug reports in the past concerning correct 64-bit division on 32-bit platforms I added some long over due regression tests. Much to my surprise the unsigned 64-bit division regression tests failed. This was surprising because __udivdi3() was implemented by simply calling div64_u64() which is provided by the kernel. This meant that the linux kernels 64-bit division algorithm on 32-bit platforms was flawed. After some investigation this turned out to be exactly the case. Because of this I was forced to abandon the kernel helper and instead to fully implement 64-bit division in the spl. There are several published implementation out there on how to do this properly and I settled on one proposed in the book Hacker's Delight. Their proposed algoritm is freely available without restriction and I have just modified it to be linux kernel friendly. The update implementation now passed all the unsigned and signed regression tests. This should be functional, but not fast, which is good enough for out purposes. If you want fast too I'd strongly suggest you upgrade to a 64-bit platform. I have also reported the kernel bug and we'll see if we can't get it fixed up stream.	2010-07-13 16:44:02 -07:00
Brian Behlendorf	1814251453	Require gawk the usermode helper fails with awk For some reason when awk invoked by the usermode helper the command always fails. Interestingly gawk does not suffer from this problem which is why I never observed this failure since the distro I tested with all had gawk installed instead of awk. Anyway, the simplest thing to do here is to just make gawk mandatory. I've added a configure check for gawk specifically and have updated the command to call gawk not awk.	2010-07-01 16:38:08 -07:00
Brian Behlendorf	7119bf7044	Add configure check for user_path_dir() I didn't notice at the time but user_path_dir() was not introduced at the same time as set_fs_pwd() change. I had lumped the two together but in fact user_path_dir() was introduced in 2.6.27 and set_fs_pwd() taking 2 args was introduced in 2.6.25. This means builds against 2.6.25-2.6.26 kernels were broken. To fix this I've added a check for user_path_dir() and no longer assume that if set_fs_pwd() takes 2 args then user_path_dir() is also available.	2010-07-01 13:53:26 -07:00
Ned Bass	55f10ae5e9	Implementation of a regression test for TQ_FRONT. Use 3 threads and 8 tasks. Dispatch the final 3 tasks with TQ_FRONT. The first three tasks keep the worker threads busy while we stuff the queues. Use msleep() to force a known execution order, assuming TQ_FRONT is properly honored. Verify that the expected completion order occurs. The splat_taskq_test5_order() function may be useful in more than one test. This commit generalizes it by renaming the function to splat_taskq_test_order() and adding a name argument instead of assuming SPLAT_TASKQ_TEST5_NAME as the test name. The documentation for splat taskq regression test #5 swaps the two required completion orders in the diagram. This commit corrects the error. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-01 10:59:52 -07:00
Ned Bass	1a73940d39	Initialize the /dev/splatctl device buffer On open() and initialize the buffer with the SPL version string. The user space splat utility expects to find the SPL version string when it opens and reads from /dev/splatctl. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-01 10:59:46 -07:00
Ned Bass	f0d8bb26b4	Implementation of the TQ_FRONT flag. Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker threads pull tasks from this high priority queue before the default pending queue. Executing tasks out of FIFO order potentially breaks taskq_lowest_id() if we do not preserve the ordering of the work list by taskqid. Therefore, instead of always appending to the work list, we search for the appropriate place to insert a task. The common case is to append to the list, so we make this operation efficient by searching the work list in reverse order. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-01 10:59:38 -07:00
Brian Behlendorf	79a3bf130b	Linux-2.6.33 compat, .ctl_name removed from struct ctl_table As of linux-2.6.33 the ctl_name member of the ctl_table struct has been entirely removed. The upstream code has been updated to depend entirely on the the procname member. To handle this all references to ctl_name are wrapped in a CTL_NAME macro which simply expands to nothing for newer kernels. Older kernels are supported by having it expand to .ctl_name = X just as before.	2010-06-30 12:49:12 -07:00
Brian Behlendorf	ede0bdffb6	Treat mutex->owner as volatile When HAVE_MUTEX_OWNER is defined and we are directly accessing mutex->owner treat is as volative with the ACCESS_ONCE() helper. Without this you may get a stale cached value when accessing it from different cpus. This can result in incorrect behavior from mutex_owned() and mutex_owner(). This is not a problem for the !HAVE_MUTEX_OWNER case because in this case all the accesses are covered by a spin lock which similarly gaurentees we will not be accessing stale data. Secondly, check CONFIG_SMP before allowing access to mutex->owner. I see that for non-SMP setups the kernel does not track the owner so we cannot rely on it. Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will not be cleared on mutex_exit(). When this is the case the SPL needs to make sure to do it to ensure MUTEX_HELD() behaves as expected or you will certainly assert in mutex_destroy(). Finally, improve the mutex regression tests. For mutex_owned() we now minimally check that it behaves correctly when checked from the owner thread or the non-owner thread. This subtle behaviour has bit me before and I'd like to catch it early next time if it reappears. As for mutex_owned() regression test additonally verify that mutex->owner is always cleared on mutex_exit().	2010-06-28 16:02:57 -07:00
Brian Behlendorf	616df2dd8b	Fix subtle race in threads test case The call to wake_up() must be moved under the spin lock because once we drop the lock 'tp' may no longer be valid because the creating thread has exited. This basic thread implementation was correct, this was simply a flaw in the test case.	2010-06-28 12:34:20 -07:00
Brian Behlendorf	e6de04b73c	Add kmem_vasprintf function We might as well have both asprintf() variants. This allows us to safely pass a va_list through several levels of the stack using va_copy() instead of va_start().	2010-06-24 09:41:59 -07:00
Brian Behlendorf	438683c0a9	Revert "Support TQ_FRONT flag used by taskq_dispatch()" This reverts commit `eb12b3782c`.	2010-06-21 10:19:44 -07:00
Brian Behlendorf	3cb77549d1	Update warnings in kmem debug code This fix was long overdue. Most of the ground work was laid long ago to include the exact function and line number in the error message which there was an issue with a memory allocation call. However, probably due to lack of time at the moment that informatin never made it in to the error message. This patch fixes that and trys to standardize the kmem debug messages as well.	2010-06-16 16:01:16 -07:00
Brian Behlendorf	eb12b3782c	Support TQ_FRONT flag used by taskq_dispatch() Allow taskq_dispatch() to insert work items at the head of the queue instead of just the tail by passing the TQ_FRONT flag.	2010-06-11 15:57:25 -07:00
Brian Behlendorf	b868e22f05	Add kmem_asprintf(), strfree(), strdup(), and minor cleanup. This patch adds three missing Solaris functions: kmem_asprintf(), strfree(), and strdup(). They are all implemented as a thin layer which just calls their Linux counterparts. As part of this an autoconf check for kvasprintf was added because it does not appear in older kernels. If the kernel does not provide it then spl-generic implements it. Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean things up and make the kmem.h a little more readable.	2010-06-11 15:57:25 -07:00
Brian Behlendorf	ae4c36adce	Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process) Under linux the proc.h header is for the /proc filesystem, and under Solaris the proc/h header if for processes. This patch correctly moves the Linux proc functionality in a linux/proc_compat.h header and leaves the sys/proc.h for use by Solaris. Minor updates were required to all the call sites where it was included of course.	2010-06-11 15:57:25 -07:00
Alex Zhuravlev	1b4ad25e2f	Stack overflow on 64-bit modulus operations on 32-bit architectures. Running 'zpool create' on a 32-bit machine with an SPL compiled with gcc 4.4.4 led to a stack overlow. This turned out to be due to some sort of 'optimization' by gcc: uint64_t __umoddi3(uint64_t dividend, uint64_t divisor) { return dividend - divisor * (dividend / divisor); } This code was supposed to be using __udivdi3 to implement /, but gcc instead implemented it via __umoddi3 itself. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-06-03 09:06:55 -07:00
Brian Behlendorf	8a1c9a02fb	Minor 32-bit fix cast to hrtime_t before the mutliply. It's important to cast to hrtime_t before doing the multiply because the ts.tv_sec type is only 32-bits and we need to promote it to 64-bits.	2010-05-23 09:51:17 -07:00
Brian Behlendorf	32f5faff69	Simplify rwlock implementation. Remove RW_COUNT() from the rwlock implementation. The idea was that it could be used as a generic wrapper for getting at the internal state of a rwlock. While a good idea it's proven problematic to keep it correct for multiple archs and internal implementation changes. In short it hasn't been worth the trouble. With that and simplicity in mind things have been updated to use the rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD() functions. As for rw_upgrade() it remains only implemented for the generic rwsem implemenation. It remains to be determined if its worth the effort of adding a custom implementation for each arch.	2010-05-20 14:20:34 -07:00
Brian Behlendorf	23d91792ef	Use KM_NODEBUG macro in preference to __GFP_NOWARN.	2010-05-20 14:16:59 -07:00
Brian Behlendorf	3626ae6a70	Disable spl_debug_panic_on_bug by default. While I may prefer to have the system panic on an SBUG and to get crash dump for analysis. I suspect most peoples systems are not configured from crash dump and the best thing to so is to simply halt the thread and print an error to the console. This way they have a good chance of actually saving the stack trace and debug log.	2010-05-20 10:15:51 -07:00
Brian Behlendorf	e0dcb22e4e	Adjust 'large' object sizes in kmem:slab_large test. 64K objects are large for a kmem based slab (2M slabs) 1M objects are large for a vmem cased slab (32M slabs)	2010-05-20 09:52:37 -07:00
Brian Behlendorf	5198ea0e71	Remove kmem_set_warning() interface replace with __GFP_NOWARN flag. Remove the kmem_set_warning() hack used by the kmem-splat regression tests with a per-allocation flag called __GFP_NOWARN. This matches the lower level linux flag of similar by slightly different function. The idea is you can then explicitly set this flag on requests where you know your breaking the max 8k rule but you need/want to do it anyway. This is currently used by the regression tests where we intentionally push things to the limit but don't want the log noise. Additionally, we are forced to use it in spl_kmem_cache_create() because by default NR_CPUS is very large and theres no easy way to handle that. Finally, I've added a stack_dump() call to the warning when it is trigger to make to clear exactly where the allocation is taking place.	2010-05-19 16:53:13 -07:00
Brian Behlendorf	627a74972c	Set default debug log patch to /tmp/spl-log. Using /tmp/ is a preferable default, it can always be overriden using the module option on a case-by-case basis. Additionally standardize some log messages based on the same default log level used by the kernel.	2010-05-19 16:17:06 -07:00
Brian Behlendorf	716154c592	Public Release Prep Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added standardized headers to all source file to clearly indicate the copyright, license, and to give credit where credit is due.	2010-05-17 15:18:00 -07:00
Brian Behlendorf	6020190e8f	Use do_posix_clock_monotonic_gettime() as described by comment. While this does incur slightly more overhead we should be using do_posix_clock_monotonic_gettime() for gethrtime() as described by the existing comment.	2010-05-14 09:31:22 -07:00
Brian Behlendorf	f752b46eb3	Add cv_wait_interruptible() function. This is a minor extension to the condition variable API to allow for reasonable signal handling on Linux. The cv_wait() function by definition must wait unconditionally for cv_signal()/cv_broadcast() before waking it. This makes it impossible to woken by a signal such as SIGTERM. The cv_wait_interruptible() function was added to handle this case. It behaves identically to cv_wait() with the exception that it waits interruptibly allowing a signal to wake it up. This means you do need to be careful and check issig() after waking.	2010-05-14 09:24:51 -07:00
Brian Behlendorf	97f8f6d789	Dump log from current process when required When dumping a debug log first check that it is safe to create a new thread and block waiting for it. If we are in an atomic context or irqs and disabled it is not safe to sleep and we must write out of the debug log from the current process.	2010-04-23 15:55:02 -07:00
Brian Behlendorf	d05ec4b45f	Assume TQ_SLEEP when not explicitly specified.	2010-04-23 14:39:47 -07:00
Ricardo Correia	663e02a135	Handle the FAPPEND option in vn_rdwr(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-04-23 14:39:42 -07:00
Brian Behlendorf	82a358d9c0	Update vn_set_pwd() to allow user\|kernal address for filename During module init spl_setup()->The vn_set_pwd("/") was failing with -EFAULT because user_path_dir() and __user_walk() both expect 'filename' to be a user space address and it's not in this case. To handle this the data segment size is increased to to ensure strncpy_from_user() does not fail with -EFAULT. Additionally, I've added a printk() warning to catch this and log it to the console if it ever reoccurs. I thought everything was working properly here because there consequences of this failing are subtle and usually non-critical.	2010-04-22 12:53:58 -07:00
Brian Behlendorf	16b719f006	Allow spl_config.h to be included by dependant packages (updated) We need dependent packages to be able to include spl_config.h to build properly. This was partially solved in commit `0cbaeb1` by using AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which autoconf always adds and cannot be easily removed. This solution works as long as the spl_config.h is included before your projects config.h. That turns out to be easier said than done. In particular, this is a problem when your package includes its config.h using the -include gcc option which ensures the first thing included is your config.h. To handle all cases cleanly I have removed the AH_BOTTOM hack and replaced it with an AC_CONFIG_HEADERS command. This command runs immediately after spl_config.h is written and with a little awk-foo it strips the offending #defines from the file. This eliminates the problem entirely and makes header safe for inclusion. Also in this change I have removed the few places in the code where spl_config.h is included. It is now added to the gcc compile line to ensure the config results are always available. Finally, I have also disabled the verbose kernel builds. If you want them back you can always build with 'make V=1'. Since things are working now they don't need to be on by default.	2010-03-22 14:45:33 -07:00
Brian Behlendorf	aa600d8a38	Reduce max kmem based slab size Allowing MAX_ORDER-1 sized allocations for kmem based slabs have been observed to result in deadlocks. To help prvent this limit max kmem based slab size to MAX_ORDER-3. Just for the record callers should not be creating slabs like this, but if they do we should still handle it as safely as we can.	2010-03-18 13:39:51 -07:00
Brian Behlendorf	21006d08af	Remove Module.markers and Module.symver{s} in clean target Split 'modules' and 'clean' Makefile targets to allow us to cleanly remove the Module.* build products with a 'make clean'.	2010-03-08 13:39:57 -08:00
Brian Behlendorf	3977f8370f	Linux 2.6.32 compat, proc_handler() API change As of linux-2.6.32 the 'struct file *filp' argument was dropped from the proc_handle() prototype. It was apparently unused _almost_ everywhere in the kernel and this was simply cleanup. I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and the proper compat macros to correctly define the prototypes and some helper functions. It's not pretty but API compat changes rarely are.	2010-03-04 12:14:56 -08:00
Ricardo M. Correia	694921bc49	sun-misc-gitignore Add .gitignore files. Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>	2010-01-08 09:37:54 -08:00
Ricardo M. Correia	f7e8739c94	sun-fix-whitespace Whitespace fixes. Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>	2010-01-08 09:37:54 -08:00
Ricardo M. Correia	b520b14305	sun-fix-panic-str Fix panic() string, which was being used as a format string, instead of an already-formatted string. Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>	2010-01-08 09:37:54 -08:00
Brian Behlendorf	5562e5d105	Added splat taskq task ordering test case. This test case verifies the correct behavior of taskq_wait_id(). In particular it ensure the the following two cases are handled properly: 1) Task ids larger than the waited for task id can run and complete as long as there is an available worker thread. 2) All task ids lower than the waited one must complete before unblocking even if the waited task id itself has completed.	2010-01-05 13:34:09 -08:00
Brian Behlendorf	82387586af	Optimize lowest outstanding taskqid calculation in taskq_lowest_id() In the initial version of taskq_lowest_id() the entire pending and work list was locked under the tq->tq_lock to determine the lowest outstanding taskqid. At the time this done because I was rushed and wanted to make sure it was right... fast was secondary. Well now fast is important too so I carefully thought through the pending and work list management and convinced myself it is safe and correct to simply check the first entry. I added a large comment to the source to explain this. But basically as long as we are careful to ensure the pending and work list stay sorted this is safe and fast. The motivation for this chance was that I was observing as much as 10% of the total CPU time go to waiting on the tq->tq_lock when the pending list was long. This resolves that problems and frees up that CPU time for something useful.	2010-01-04 15:52:26 -08:00
Brian Behlendorf	ef1c7a0691	Strip __GFP_ZERO from kmalloc it is not available for older kernels. This is needed to avoid a BUG_ON() on RHEL5.4 kernel 2.6.18-164.6.1, since __GFP_ZERO is not a valid flag for kmalloc().	2009-12-23 12:57:10 -08:00
Brian Behlendorf	641bebe35f	Fix kmem:slab_overcommit regression test locking This regression test could crash in splat_kmem_cache_test_reclaim() due to a race between the slab relclaim and the normal exiting of the thread. Specifically, the kct structure could be free'd by the thread performing the allocations while the reclaim function was also working on that's threads kct structure. The simplest fix is to extend the kcp->kcp_lock over the reclaim to prevent the kct from being freed. A better fix would be to ref count these structures, but since is just a regression this locking change is enough. Surprisingly this was only observed commonly under RHEL5.4 but all platform could have hit this.	2009-12-23 12:46:11 -08:00
Brian Behlendorf	242f539a2e	Add skc_flags and full header to /proc/spl/kmem/slab.	2009-12-11 11:20:08 -08:00
Brian Behlendorf	f60a5f5221	Splat vnode tests must return negative error codes. I must have been in a hurry when I wrote the vnode regression tests because the error code handling is not correct. The Solaris vnode API returns positive errno's, these need to be converted to negative errno's for Linux before being passed back to user space. Otherwise the test hardness with report the failure but errno will not be set with the correct error code. Additionally tests 3, 4, 6, and 7 may fail in the test file already exists. To avoid false positives a user mode helper has added to remove the test files in /tmp/ before running the actual test.	2009-12-10 15:06:07 -08:00
Brian Behlendorf	d04c8a563c	Atomic64 compatibility for 32-bit systems without kernel support. This patch is another step towards updating the code to handle the 32-bit kernels which I have not been regularly testing. This changes do not really impact the common case I'm expected which is the latest kernel running on an x86_64 arch. Until the linux-2.6.31 kernel the x86 arch did not have support for 64-bit atomic operations. Additionally, the new atomic_compat.h support for this case was wrong because it embedded a spinlock in the atomic variable which must always and only be 64-bits total. To handle these 32-bit issues we now simply fall back to the --enable-atomic-spinlock implementation if the kernel does not provide the 64-bit atomic funcs. The second issue this patch addresses is the DEBUG_KMEM assumption that there will always be atomic64 funcs available. On 32-bit archs this may not be true, and actually that's just fine. In that case the kernel will will never be able to allocate more the 32-bits worth anyway. So just check if atomic64 funcs are available, if they are not it means this is a 32-bit machine and we can safely use atomic_t's instead.	2009-12-04 15:54:12 -08:00
Brian Behlendorf	db1aa22297	Correctly handle division on 32-bit RHEL5 systems by returning dividend.	2009-12-01 15:53:28 -08:00
Brian Behlendorf	4e5691faf6	Only run the kmem overcommit test on 64-bit systems.	2009-12-01 11:40:47 -08:00
Brian Behlendorf	6ff686c44d	Type long expected explicitly cast for 32-bit systems.	2009-12-01 10:14:01 -08:00

... 2 3 4 5 6 ...

415 Commits