Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	0d3ac5e735	Linux 2.6.29 compat, credentials The .sync_fs fix as applied did not use the updated SPL credential API. This broke builds on Debian Lenny, this change applies the needed fix to use the portable API. The original credential changes are part of commit `81e97e2187`.	2011-04-07 14:27:09 -07:00
Brian Behlendorf	eec8164771	Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Disable the normal reclaim path for the txg_sync thread. This ensures the thread will never enter dmu_tx_assign() which can otherwise occur due to direct reclaim. If this is allowed to happen the system can deadlock. Direct reclaim call path: ->shrink_icache_memory->prune_icache->dispose_list-> clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign	2011-04-07 09:52:16 -07:00
Brian Behlendorf	7cb67b45f3	Add direct+indirect ARC reclaim Under OpenSolaris all memory reclaim is done asyncronously. Under Linux memory reclaim is done asynchronously _and_ synchronously. When a process allocates memory with GFP_KERNEL it explicitly allows the kernel to do reclaim on its behalf to satify the allocation. If that GFP_KERNEL allocation fails the kernel may take more drastic measures to reclaim the memory such as killing user space processes. This was observed to happen with ZFS because the ARC could consume a large fraction of the system memory but no synchronous reclaim could be performed on it. The result was GFP_KERNEL allocations could fail resulting in OOM events, and only moments latter the arc_reclaim thread would free unused memory from the ARC. This change leaves the arc_thread in place to manage the fundamental ARC behavior. But it adds a synchronous (direct) reclaim path for the ARC which can be called when memory is badly needed. It also adds an asynchronous (indirect) reclaim path which is called much more frequently to prune the ARC slab caches.	2011-04-07 09:52:10 -07:00
Brian Behlendorf	1834f2d8b7	Add missing arcstats The following useful values were missing the arcstats. This change adds them in to provide greater visibility in to the arcs behavior. arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_meta_used 4 624774592 arc_meta_limit 4 400785408 arc_meta_max 4 625594176	2011-04-07 09:52:05 -07:00
Brian Behlendorf	c85b224faf	Call d_instantiate before unlocking inode Under Linux a dentry referencing an inode must be instantiated before the inode is unlocked. To accomplish this without overly modifing the core ZFS code the dentry it passed via the vattr_t. There are cases such as replay when a dentry is not available. In which case it is obviously not initialized at inode creation time, if a dentry is needed it will be spliced as when required via d_lookup().	2011-04-07 09:51:57 -07:00
Brian Behlendorf	d433c20651	Fix `make distclean` for `./configure --with-config=user Making distclean in module make[1]: Entering directory `/zfs/module' make -C SUBDIRS=`pwd` clean make: Entering an unknown directory make: *** SUBDIRS=/zfs/module: No such file or directory. Stop. When using --with-config=user the 'distclean' target would fail because it assumes the kernel configuration infrastrure is set up. This is not the case, nor does it need to be, because the '--with-config=user' option will prune the entire ./module subtree from SUBDIRS. This prevents most build rules from operating in the ./module directory. However, the 'dist*' rules will still traverse this directory because it is listed in DIST_SUBDIRS. This is correct because we need to ensure the dist rules package the directory contents regardless of the configuration for the 'dist' rule. The correct way to handle this is to only invoke the kernel build system as part of the 'clean' rule when CONFIG_KERNEL_TRUE is set. Initial fix provided by Darik Horn <dajhorn@vanadac.com>. This commit is a slightly refined form of the original.	2011-04-05 13:33:28 -07:00
Brian Behlendorf	bfd214af01	Fix inflated load average Kernel threads which sleep uninterruptibly on Linux are marked in the (D) state. These threads are usually in the process of performing IO and are thus counted against the load average. The txg_quiesce and txg_sync threads were always sleeping uninterruptibly and thus inflating the load average. This change makes them sleep interruptibly. Some care is required however because these threads may now be woken early by signals. In this case the callers are all careful to check that the required conditions are met after waking up. If we're woken early due to a signal they will simply go back to sleep. In this case these changes are safe. Closes #175	2011-03-31 17:07:12 -07:00
Brian Behlendorf	7a1cdc0775	Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29 Since these hooks are currently unused they are being removed to allow support of older kernels.	2011-03-22 12:17:24 -07:00
Brian Behlendorf	81e97e2187	Linux 2.6.29 compat, credentials As of Linux 2.6.29 a clean credential API was added to the Linux kernel. Previously the credential was embedded in the task_struct. Because the SPL already has considerable support for handling this API change the ZPL code has been updated to use the Solaris credential API.	2011-03-22 12:15:54 -07:00
Brian Behlendorf	d6bd8eaae4	Fix evict() deadlock Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility of synchronous reclaim deadlocks. These deadlocks never existed in the original OpenSolaris code because all memory reclaim on Solaris is done asyncronously. Linux does both synchronous (direct) and asynchronous (indirect) reclaim. This commit addresses a deadlock caused by inode eviction. A KM_SLEEP allocation may trigger direct memory reclaim and shrink the inode cache. This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()-> zfs_zinactive() call path the same mutex may be reacquired resulting in a deadlock. To avoid this deadlock the process must not reacquire the mutex when it is already holding it. This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD mutex locking should be reevaluated. This infrastructure already prevents us from ever using the Linux lock dependency analysis tools, and it may limit scalability.	2011-03-22 12:14:55 -07:00
Brian Behlendorf	691f6ac4c2	Use KM_PUSHPAGE instead of KM_SLEEP It used to be the case that all KM_SLEEP allocations were GFS_NOFS. Unfortunately this often resulted in the kernel being unable to reclaim the ARC, inode, and dentry caches in a timely manor. The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL. However, this increases the posibility of deadlocking the system on a zfs write thread. If a zfs write thread attempts to perform an allocation it may trigger synchronous reclaim. This reclaim may attempt to flush dirty data/inode to disk to free memory. Unforunately, this write cannot finish because the write thread which would handle it is holding the previous transaction open. Deadlock. To avoid this all allocations in the zfs write thread path must use KM_PUSHPAGE which prohibits synchronous reclaim for that thread. In this way forward progress in ensured. The risk with this change is I missed updating an allocation for the write threads leaving an increased posibility of deadlock. If any deadlocks remain they will be unlikely but we'll have to make sure they all get fixed.	2011-03-22 12:14:55 -07:00
Brian Behlendorf	0de19dad9c	Register .remount_fs handler Register the missing .remount_fs handler. This handler isn't strictly required because the VFS does a pretty good job updating most of the MS_* flags. However, there's no harm in using the hook to call the registered zpl callback for various MS_* flags. Additionaly, this allows us to lay the ground work for more complicated argument parsing in the future.	2011-03-15 13:33:29 -07:00
Brian Behlendorf	03f9ba9d99	Register .sync_fs handler Register the missing .sync_fs handler. This is a noop in most cases because the usual requirement is that sync just be initiated. As part of the DMU's normal transaction processing txgs will be frequently synced. However, when the 'wait' flag is set the requirement is that .sync_fs must not return until the data is safe on disk. With the addition of the .sync_fs handler this is now properly implemented.	2011-03-15 13:33:29 -07:00
Brian Behlendorf	04516a45b2	Don't set I/O Scheduler for Partitions ZFS should only change the i/o scheduler for a disk when it has ownership of the whole disk. This is basically the same logic as adjusting the write cache behavior on a disk. This change updates the vdev disk code to skip partitions when setting the i/o scheduler. Closes #152	2011-03-10 13:34:17 -08:00
Brian Behlendorf	adf2e8778e	Fix O_APPEND Corruption Due to an uninitialized variable files opened with O_APPEND may overwrite the start of the file rather than append to it. This was introduced accidentally when I removed the Solaris vnodes. The zfs_range_lock_writer() function used to key off zf->z_vnode to determine if a znode_t was for a zvol of zpl object. With the removal of vnodes this was replaced by the flag zp->z_is_zvol. This flag was used to control the append behavior for range locks. Unfortunately, this value was never properly initialized after the vnode removal. However, because most of memory is usually zeros it happened to be set correctly most of the time making the bug appear racy. Properly initializing zp->z_is_zvol to zero completely resolves the problem with O_APPEND. Closes #126	2011-03-09 13:31:00 -08:00
Brian Behlendorf	17c37660a1	Conserve stack in zfs_setattr() Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize stack space usage. These two arrays consumed 448 bytes on the stack and have been replaced by two 8 byte points for a total stack space saving of 432 bytes. The zfs_setattr() path had been previously observed to overrun the stack in certain circumstances.	2011-03-09 13:30:03 -08:00
Brian Behlendorf	450dc149bd	Range lock performance improvements The original range lock implementation had to be modified by commit `8926ab7` because it was unsafe on Linux. In particular, calling cv_destroy() immediately after cv_broadcast() is dangerous because the waiters may still be asleep. Thus the following cv_destroy() will free memory which may still be in use. This was fixed by updating cv_destroy() to block on waiters but this in turn introduced a deadlock. The deadlock was resolved with the use of a taskq to move the offending free outside the range lock. This worked well but using the taskq for the free resulted in a serious performace hit. This is somewhat ironic because at the time I felt using the taskq might improve things by making the free asynchronous. This patch refines the original fix and moves the free from the taskq to a private free list. Then items which must be free'd are simply inserted in to the list. When the range lock is dropped it's safe to free the items. The list is walked and all rl_t entries are freed. This change improves small cached read performance by 26x. This was expected because for small reads the number of locking calls goes up significantly. More surprisingly this change significantly improves large cache read performance. This probably attributable to better cpu/memory locality. Very likely the same processor which allocated the memory is now freeing it. bs ext3 zfs zfs+fix faster ---------------------------------------------- 512 435 3 79 26x 1k 820 7 160 22x 2k 1536 14 305 21x 4k 2764 28 572 20x 8k 3788 50 1024 20x 16k 4300 86 1843 21x 32k 4505 138 2560 18x 64k 5324 252 3891 15x 128k 5427 276 4710 17x 256k 5427 413 5017 12x 512k 5427 497 5324 10x 1m 5427 521 5632 10x Closes #142	2011-03-08 12:44:06 -08:00
Brian Behlendorf	126400a1ca	Add zfs_open()/zfs_close() In the original implementation the zfs_open()/zfs_close() hooks were dropped for simplicity. This was functional but not 100% correct with the expected ZFS sematics. Updating and re-adding the zfs_open()/zfs_close() hooks resolves the following issues. 1) The ZFS_APPENDONLY file attribute is once again honored. While there are still no Linux tools to set/clear these attributes once there are it should behave correctly. 2) Minimal virus scan file attribute hooks were added. Once again this support in disabled but the infrastructure is back in place. 3) Most importantly correctly handle assigning files which were opened syncronously to the intent log. Without this change O_SYNC modifications could be lost during a system crash even though they were marked synchronous.	2011-03-08 11:04:51 -08:00
Brian Behlendorf	53cf50e081	Set stat->st_dev and statfs->f_fsid Filesystems like ZFS must use what the kernel calls an anonymous super block. Basically, this is just a filesystem which is not backed by a single block device. Normally this block device's dev_t is stored in the super block. For anonymous super blocks a unique reserved dev_t is assigned as part of get_sb(). This sb->s_dev must then be set in the returned stat structures as stat->st_dev. This allows userspace utilities to easily detect the boundries of a specific filesystem. Tools such as 'du' depend on this for proper accounting. Additionally, under OpenSolaris the statfs->f_fsid is set to the device id. To preserve consistency with OpenSolaris we also set the fsid to the device id. Other Linux filesystem (ext) set the fsid to a unique value determined by the filesystems uuid. This value is unique but maintains no relationship to the device id. This may be desirable when exporting NFS filesystem because it minimizes to chance of a client observing the same fsid from two different servers. Closes #140	2011-03-07 16:06:22 -08:00
Brian Behlendorf	6742abf9ec	Use Linux ATTR_ versions The AT_ versions of these macros are used on Solaris and while they map to their Linux equivilants the code has been updated to use the ATTR_ versions.	2011-03-03 11:29:15 -08:00
Brian Behlendorf	f4ea75d492	Conserve stack in zfs_setattr() Move 'tmpxvattr' from the stack to the heap to minimize stack space usage. This is enough to get us below the 1024 byte stack frame warning. That however is still a large stack frame and it should be further reduced by moving the 'bulk' and 'xattr_bulk' sa_bulk_attr_t variables to the heap in a future patch.	2011-03-02 14:18:58 -08:00
Brian Behlendorf	5484965ab6	Drop HAVE_XVATTR macros When I began work on the Posix layer it immediately became clear to me that to integrate cleanly with the Linux VFS certain Solaris specific things would have to go. One of these things was to elimate as many Solaris specific types from the ZPL layer as possible. They would be replaced with their Linux equivalents. This would not only be good for performance, but for the general readability and health of the code. The Solaris and Linux VFS are different beasts and should be treated as such. Most of the code remains common for constructing transactions and such, but there are subtle and important differenced which need to be repsected. This policy went quite for for certain types such as the vnode_t, and it initially seemed to be working out well for the vattr_t. There was a relatively small amount of related xvattr_t code I was forced to comment out with HAVE_XVATTR. But it didn't look that hard to come back soon and replace it all with a native Linux type. However, after going doing this path with xvattr some distance it clear that this code was woven in the ZPL more deeply than I thought. In particular its hooks went very deep in to the ZPL replay code and replacing it would not be as easy as I originally thought. Rather than continue persuing replacing and removing this code I've taken a step back and reevaluted things. This commit reverts many of my previous commits which removed xvattr related code. It restores much of the code to its original upstream state and now relies on improved xvattr_t support in the zfs package itself. The result of this is that much of the code which I had commented out, which accidentally broke things like replay, is now back in place and working. However, there may be a small performance impact for getattr/setattr operations because they now require a translation from native Linux to Solaris types. For now that's a price I'm willing to pay. Once everything is completely functional we can revisting the issue of removing the vattr_t/xvattr_t types. Closes #111	2011-03-02 11:44:34 -08:00
Brian Behlendorf	9623f736d9	Remove caller_context_t Remove the remaining callers of caller_context_t. This type has been removed because it is not needed for the Linux port.	2011-03-02 11:35:35 -08:00
Darik Horn	a23cc0a443	Add the zpool and filesystem versions Print the supported zpool and filesystem versions at module load time. This change removes an ambiguity and adds information that system administrators care about. The phrase "ZFS pool version %s" is the same as zpool upgrade -v so that the operator is familiar with the message. ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5 ZFS: Unloaded module v0.6.0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-02-28 09:46:23 -08:00
Brian Behlendorf	fdcd952b4d	Fix set block scheduler warnings There were two cases when attempting to set the vdev block device scheduler which would causes console warnings. The first case was when the vdev used a loop, ram, dm, or other such device which doesn't support a configurable scheduler. In these cases attempting to set a scheduler is pointless and can be safely skipped. The secord case is slightly more troubling. We were seeing transient cases where setting the elevator would return -EFAULT. On retry everything is fine so there appears to be a small window where this is possible. To handle that case we silently retry up to three times before reporting the warning. In all of the above cases the warning is harmless and at worse you may see slightly different performance characteristics from one or more of your vdevs.	2011-02-25 11:37:11 -08:00
Fajar A. Nugraha	4c0d8e50b9	Use udev to create /dev/zvol/[dataset_name] links This commit allows zvols with names longer than 32 characters, which fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102. Changes include: - use /dev/zd* device names for zvol, where * is the device minor (include/sys/fs/zfs.h, module/zfs/zvol.c). - add BLKZNAME ioctl to get dataset name from userland (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id). - add udev rule to create /dev/zvol/[dataset_name] and the legacy /dev/[dataset_name] symlink. For partitions on zvol, it will create /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules, cmd/zvol_id). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-02-25 09:43:19 -08:00
Brian Behlendorf	dc1d7665c5	Remove rdev packing Remove custom code to pack/unpack dev_t's. Under Linux all dev_t's are an unsigned 32-bit value even on 64-bit platforms. The lower 20 bits are used for the minor number and the upper 12 for the major number. This means if your importing a pool from Solaris you may get strange major/minor numbers. But it doesn't really matter because even if we add compatibility code to translate the encoded Solaris major/minor they won't do you any good under Linux. You will still need to recreate the dev_t with a major/minor which maps to reserved major numbers used under Linux. Dropping this code also resolves 32-bit builds by removing the offending 32-bit compatibility code.	2011-02-23 15:13:03 -08:00
Brian Behlendorf	99c564bc48	Use correct ASSERT3* variant ASSERT3P should be used instead of ASSERT3U when comparing pointers. Using ASSERT3U with the cast causes a compiler warning for 32-bit builds which is fatal with --enable-debug.	2011-02-23 15:03:30 -08:00
Brian Behlendorf	05ff35c602	Increase fragment size to block size The underlying storage pool actually uses multiple block size. Under Solaris frsize (fragment size) is reported as the smallest block size we support, and bsize (block size) as the filesystem's maximum block size. Unfortunately, under Linux the fragment size and block size are often used interchangeably. Thus we are forced to report both of them as the filesystem's maximum block size. Closes #112	2011-02-23 14:00:06 -08:00
Brian Behlendorf	f6dcdf13f8	Fix 'statement with no effect' warning Because the secpolicy_* macros are all currently defined to (0). And because the caller of this function does not check the return code. The compiler complains that this statement has no effect which is correct and OK. To suppress the warning explictly cast the result to (void).	2011-02-23 13:03:19 -08:00
Brian Behlendorf	a31a70bbd1	Fix enum compiler warning Generally it's a good idea to use enums for switch statements, but in this case it causes warning because the enum is really a set of flags. These flags are OR'ed together in some cases resulting in values which are not part of the original enum. This causes compiler warning such as this about invalid cases. error: case value ‘33’ not in enumerated type ‘zprop_source_t’ To handle this we simply case the enum to an int for the switch statement. This leaves all other enum type checking in place and effectively disabled these warnings.	2011-02-23 12:52:51 -08:00
Brian Behlendorf	61e909608d	Linux 2.6.x compat, blkdev_compat.h For legacy reasons the zvol.c and vdev_disk.c Linux compatibility code ended up in sys/blkdev.h and sys/vdev_disk.h headers. While there are worse places for this code to live it should be in a linux/blkdev_compat.h header. This change moves this block device Linux compatibility code in to the linux/blkdev_compat.h header and updates all the correct #include locations. This is not a functional change or bug fix, it is just code cleanup.	2011-02-23 12:29:38 -08:00
Brian Behlendorf	5d0265c0dd	Merge branch 'zpl'	2011-02-18 09:31:25 -08:00
Brian Behlendorf	037849f854	Use provided uid/gid for setattr When changing the uid/gid of a file via zfs_setattr() use the Posix id passed in iattr->ia_uid/gid. While the zfs_fuid_create() code already had the fuid support disabled for Linux it was returning the uid/gid from the credential. With this change the 'chown' command which relies on setxattr is now working properly. Also remove a little stray white space which was in front of zfs_update_inode() call and the end of zfs_setattr().	2011-02-17 14:23:48 -08:00
Brian Behlendorf	efd1832bc6	Fix symlink(2) inode reference count Under Linux sys_symlink(2) should result in a inode being created with one reference for the inode itself, and a second reference on the inode which is held by the new dentry. Under Solaris this appears not to be the case. Their zfs_symlink() handler drops the inode reference before returning. The result of this under Linux is that the reference count for symlinks is always one smaller than it should have been. This results in a BUG() when the symlink is unlinked. To handle this the Linux port now keeps the inode reference which differs from the Solaris behavior. This results in correct reference counts. Closes #96	2011-02-17 11:34:47 -08:00
Brian Behlendorf	5095000169	Use -zfs_readlink() error The zfs_readlink() function returns a Solaris positive error value and that needs to be converted to a Linux negative error value. While in this case nothing would actually go wrong, it's still incorrect and should be fixed if for no other reason than clarity.	2011-02-17 09:48:06 -08:00
Brian Behlendorf	8b4f9a2d55	Fix readlink(2) This patch addresses three issues related to symlinks. 1) Revert the zfs_follow_link() function to a modified version of the original zfs_readlink(). The only changes from the original OpenSolaris version relate to using Linux types. For the moment this means no vnode's and no zfsvfs_t. The caller zpl_follow_link() was also updated accordingly. This change was reverted because it was slightly gratuitious. 2) Update zpl_follow_link() to use local variables for the link buffer. I'd forgotten that iov.iov_base is updated by uiomove() so after the call to zfs_readlink() it can not longer be used. We need our own private copy of the link pointer. 3) Allocate MAXPATHLEN instead of MAXPATHLEN+1. By default MAXPATHLEN is 4096 bytes which is a full page, adding one to it pushes it slightly over a page. That means you'll likely end up allocating 2 pages which is wasteful of memory and possibly slightly slower.	2011-02-16 15:54:55 -08:00
Ricardo M. Correia	54a179e7b8	Add API to wait for pending commit callbacks This adds an API to wait for pending commit callbacks of already-synced transactions to finish processing. This is needed by the DMU-OSD in Lustre during device finalization when some callbacks may still not be called, this leads to non-zero reference count errors. See lustre.org bug 23931.	2011-02-16 11:20:06 -08:00
Brian Behlendorf	a6695d83b7	Add get/setattr, get/setxattr hooks While the attr/xattr hooks were already in place for regular files this hooks can also apply to directories and special files. While they aren't typically used in this way, it should be supported. This patch registers these additional callbacks for both directory and special inode types.	2011-02-16 09:55:53 -08:00
Brian Behlendorf	d8fd10545b	Fix FIFO and socket handling Under Linux when creating a fifo or socket type device in the ZFS filesystem it's critical that the rdev is stored in a SA. This was already being correctly done for character and block devices, but that logic needed to be extended to include FIFOs and sockets. This patch takes care of device creation but a follow on patch may still be required to verify that the dev_t is being correctly packed/unpacked from the SA.	2011-02-16 09:51:44 -08:00
Brian Behlendorf	d567444809	Create minors for all zvols It was noticed that when you have zvols in multiple datasets not all of the zvol devices are created at module load time. Fajarnugraha did the leg work to identify that the root cause of this bug is a non-zero return value from zvol_create_minors_cb(). Returning a non-zero value from the dmu_objset_find_spa() callback function results in aborting processing the remaining children in a dataset. Since we want to ensure that the callback in run on all children regardless of error simply unconditionally return zero from the zvol_create_minors_cb(). This callback function is solely used for this purpose so surpressing the error is safe. Closes #96	2011-02-16 09:50:06 -08:00
Brian Behlendorf	2c395def27	Linux 2.6.36 compat, sops->evict_inode() The new prefered inteface for evicting an inode from the inode cache is the ->evict_inode() callback. It replaces both the ->delete_inode() and ->clear_inode() callbacks which were previously used for this.	2011-02-11 13:47:51 -08:00
Brian Behlendorf	f9637c6c8b	Linux 2.6.33 compat, get/set xattr callbacks The xattr handler prototypes were sanitized with the idea being that the same handlers could be used for multiple methods. The result of this was the inode type was changes to a dentry, and both the get() and set() hooks had a handler_flags argument added. The list() callback was similiarly effected but no autoconf check was added because we do not use the list() callback.	2011-02-11 10:41:00 -08:00
Brian Behlendorf	7268e1bec8	Linux 2.6.35 compat, fops->fsync() The fsync() callback in the file_operations structure used to take 3 arguments. The callback now only takes 2 arguments because the dentry argument was determined to be unused by all consumers. To handle this a compatibility prototype was added to ensure the right prototype is used. Our implementation never used the dentry argument either so it's just a matter of using the right prototype.	2011-02-11 09:05:51 -08:00
Brian Behlendorf	777d4af891	Linux 2.6.35 compat, const struct xattr_handler The const keyword was added to the 'struct xattr_handler' in the generic Linux super_block structure. To handle this we define an appropriate xattr_handler_t typedef which can be used. This was the preferred solution because it keeps the code clean and readable.	2011-02-10 16:29:00 -08:00
Brian Behlendorf	6839eed23e	Use 'noop' IO Scheduler Initial testing has shown the the right IO scheduler to use under Linux is noop. This strikes the ideal balance by allowing the zfs elevator to do all request ordering and prioritization. While allowing the Linux elevator to do the maximum front/back merging allowed by the physical device. This yields the largest possible requests for the device with the lowest total overhead. While 'noop' should be right for your system you can choose a different IO scheduler with the 'zfs_vdev_scheduler' option. You may set this value to any of the standard Linux schedulers: noop, cfq, deadline, anticipatory. In addition, if you choose 'none' zfs will not attempt to change the IO scheduler for the block device.	2011-02-10 09:27:22 -08:00
Brian Behlendorf	4db77a74a6	Suppress large kmem_alloc() warning The following warning was observed under normal operation. It's not fatal but it's something to be addressed long term. Flag the offending allocation with KM_NODEBUG to suppress the warning and flag the call site. SPL: Showing stack for process 21761 Pid: 21761, comm: iozone Tainted: P ---------------- 2.6.32-71.14.1.el6.x86_64 #1 Call Trace: [<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl] [<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl] [<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs] [<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs] [<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs] [<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs] [<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs] [<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs] [<ffffffff8116d905>] vfs_read+0xb5/0x1a0 [<ffffffff8116da41>] sys_read+0x51/0x90 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b	2011-02-10 09:27:22 -08:00
Brian Behlendorf	ceb43b935d	Invalidate dcache and inode cache When performing a 'zfs rollback' it's critical to invalidate the previous dcache and inode cache. If we don't there will stale cache entries which when accessed will result in EIOs.	2011-02-10 09:27:22 -08:00
Brian Behlendorf	8926ab7a50	Move cv_destroy() outside zp->z_range_lock() With the recent SPL change (`d599e4fa`) that forces cv_destroy() to block until all waiters have been woken. It is now unsafe to call cv_destroy() under the zp->z_range_lock() because it is used as the condition variable mutex. If there are waiters cv_destroy() will block until they wake up and aquire the mutex. However, they will never aquire the mutex because cv_destroy() will not return allowing it's caller to drop the lock. Deadlock. To avoid this cv_destroy() is now run asynchronously in a taskq. This solves two problems: 1) It is no longer run under the zp->z_range_lock so no deadlock. 2) Since cv_destroy() may now block we don't want this slowing down zfs_range_unlock() and throttling the system. This was not as much of an issue under OpenSolaris because their cv_destroy() implementation does not do anything. They do however risk a bad paging request if cv_destroy() returns, the memory holding the condition variable is free'd, and then the waiters wake up and try to reference it. It's a very small unlikely race, but it is possible.	2011-02-10 09:27:21 -08:00
Brian Behlendorf	c0d35759c5	Add mmap(2) support It's worth taking a moment to describe how mmap is implemented for zfs because it differs considerably from other Linux filesystems. However, this issue is handled the same way under OpenSolaris. The issue is that by design zfs bypasses the Linux page cache and leaves all caching up to the ARC. This has been shown to work well for the common read(2)/write(2) case. However, mmap(2) is problem because it relies on being tightly integrated with the page cache. To handle this we cache mmap'ed files twice, once in the ARC and a second time in the page cache. The code is careful to keep both copies synchronized. When a file with an mmap'ed region is written to using write(2) both the data in the ARC and existing pages in the page cache are updated. For a read(2) data will be read first from the page cache then the ARC if needed. Neither a write(2) or read(2) will will ever result in new pages being added to the page cache. New pages are added to the page cache only via .readpage() which is called when the vfs needs to read a page off disk to back the virtual memory region. These pages may be modified without notifying the ARC and will be written out periodically via .writepage(). This will occur due to either a sync or the usual page aging behavior. Note because a read(2) of a mmap'ed file will always check the page cache first even when the ARC is out of date correct data will still be returned. While this implementation ensures correct behavior it does have have some drawbacks. The most obvious of which is that it increases the required memory footprint when access mmap'ed files. It also adds additional complexity to the code keeping both caches synchronized. Longer term it may be possible to cleanly resolve this wart by mapping page cache pages directly on to the ARC buffers. The Linux address space operations are flexible enough to allow selection of which pages back a particular index. The trick would be working out the details of which subsystem is in charge, the ARC, the page cache, or both. It may also prove helpful to move the ARC buffers to a scatter-gather lists rather than a vmalloc'ed region. Additionally, zfs_write/read_common() were used in the readpage and writepage hooks because it was fairly easy. However, it would be better to update zfs_fillpage and zfs_putapage to be Linux friendly and use them instead.	2011-02-10 09:27:21 -08:00

1 2 3 4

177 Commits