Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	82a37189aa	Implement SA based xattrs The current ZFS implementation stores xattrs on disk using a hidden directory. In this directory a file name represents the xattr name and the file contexts are the xattr binary data. This approach is very flexible and allows for arbitrarily large xattrs. However, it also suffers from a significant performance penalty. Accessing a single xattr can requires up to three disk seeks. 1) Lookup the dnode object. 2) Lookup the dnodes's xattr directory object. 3) Lookup the xattr object in the directory. To avoid this performance penalty Linux filesystems such as ext3 and xfs try to store the xattr as part of the inode on disk. When the xattr is to large to store in the inode then a single external block is allocated for them. In practice most xattrs are small and this approach works well. The addition of System Attributes (SA) to zfs provides us a clean way to make this optimization. When the dataset property 'xattr=sa' is set then xattrs will be preferentially stored as System Attributes. This allows tiny xattrs (~100 bytes) to be stored with the dnode and up to 64k of xattrs to be stored in the spill block. If additional xattr space is required, which is unlikely under Linux, they will be stored using the traditional directory approach. This optimization results in roughly a 3x performance improvement when accessing xattrs which brings zfs roughly to parity with ext4 and xfs (see table below). When multiple xattrs are stored per-file the performance improvements are even greater because all of the xattrs stored in the spill block will be cached. However, by default SA based xattrs are disabled in the Linux port to maximize compatibility with other implementations. If you do enable SA based xattrs then they will not be visible on platforms which do not support this feature. ---------------------------------------------------------------------- Time in seconds to get/set one xattr of N bytes on 100,000 files ------+--------------------------------+------------------------------ \| setxattr \| getxattr bytes \| ext4 xfs zfs-dir zfs-sa \| ext4 xfs zfs-dir zfs-sa ------+--------------------------------+------------------------------ 1 \| 2.33 31.88 21.50 4.57 \| 2.35 2.64 6.29 2.43 32 \| 2.79 30.68 21.98 4.60 \| 2.44 2.59 6.78 2.48 256 \| 3.25 31.99 21.36 5.92 \| 2.32 2.71 6.22 3.14 1024 \| 3.30 32.61 22.83 8.45 \| 2.40 2.79 6.24 3.27 4096 \| 3.57 317.46 22.52 10.73 \| 2.78 28.62 6.90 3.94 16384 \| n/a 2342.39 34.30 19.20 \| n/a 45.44 145.90 7.55 65536 \| n/a 2941.39 128.15 131.32* \| n/a 141.92 256.85 262.12* Legend: * ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'. * xfs - Stock RHEL6.1 xfs mounted with default options. * zfs-dir - Directory based xattrs only. * zfs-sa - Prefer SAs but spill in to directories as needed, a trailing * indicates overflow in to directories occured. NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file. NOTE: XFS and ZFS have no limit on xattr name/value pairs per file. NOTE: Linux limits individual name/value pairs to 65536 bytes. NOTE: All setattr/getattr's were done after dropping the cache. NOTE: All tests were run against a single hard drive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #443	2011-11-28 15:45:51 -08:00
Brian Behlendorf	5547c2f1bf	Simplify BDI integration Update the code to use the bdi_setup_and_register() helper to simplify the bdi integration code. The updated code now just registers the bdi during mount and destroys it during unmount. The only complication is that for 2.6.32 - 2.6.33 kernels the helper wasn't available so in these cases the zfs code must provide it. Luckily the bdi_setup_and_register() function is trivial. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #367	2011-11-08 10:19:03 -08:00
Brian Behlendorf	ae6ba3dbe6	Improve meta data performance Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258	2011-11-03 10:19:21 -07:00
Alexander Stetsenko	8d35c1499d	Illumos #755 : dmu_recv_stream builds incomplete guid_to_ds_map An incomplete guid_to_ds_map would cause restore_write_byref() to fail while receiving a de-duplicated backup stream. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D`Amore <garrett@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/755 - https://github.com/illumos/illumos-gate/commit/ec5cf9d53a Signed-off-by: Gunnar Beutner <gunnar@beutner.name> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #372	2011-10-18 11:18:14 -07:00
Brian Behlendorf	86f35f34f4	Export symbols for the VFS API Export all symbols already marked extern in the zfs_vfsops.h header. Several non-static symbols have also been added to the header and exportewd. This allows external modules to more easily create and manipulate properly created ZFS filesystem type datasets. Rename zfsvfs_teardown() to zfs_sb_teardown and export it. This is done simply for consistency with the rest of the code base. All other zfsvfs_* functions have already been renamed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:25:59 -07:00
Brian Behlendorf	e45aa45298	Export symbols for the full SA API Export all the symbols for the system attribute (SA) API. This allows external module to cleanly manipulate the SAs associated with a dnode. Documention for the SA API can be found in the module/zfs/sa.c source. This change also removes the zfs_sa_uprade_pre, and zfs_sa_uprade_post prototypes. The functions themselves were dropped some time ago. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-05 15:59:56 -07:00
Brian Behlendorf	dee28b0700	Export symbols for the full ZAP API Export all the symbols for the ZAP API. This allows external modules to cleanly interface with ZAP type objects. Previously only a subset of the functionality was exposed. Documention for the ZAP API can be found in the sys/zap.h header. This change also removes a duplicate zap_increment_int() prototype. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-27 16:12:36 -07:00
Zachary Bedell	7a0232735d	Make libefi-created GPT compatible with gptfdisk GPT's created by libefi set the HeaderSize attribute in the GPT header to 512 -- size of the GPT header INCLUDING the 420 padding bytes at the end. Most other tools set the size to 92 -- size of the actual header itself excluding the padding. Most tools check the recorded HeaderSize when verifying CRC, but gptfdisk hardcodes 92 and thus reports CRC verification problems for full-disk vdevs created IE with `zpool create pool sdc`. This commit changes libefi's behavior for GPT creation and also fixes several edge cases where libefi's behavior was similar (though in an incompatible manner) to gptfdisk. Libefi assumed HeaderSize was always 512 even if the GPT recorded a different value. Sanity checks of the GPT headersize read from disk were added before applying checksum calculation -- this will prevent segfault in cases of bogus on-disk values. Zpools created with the resuling libefi are verified as correct both by parted and gptfdisk. Also pool have been tested to import correctly on ZFS on Linux as well as Solaris Express 11 livecd. Signed-off-by: Zachary Bedell <zac@thebedells.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #344	2011-09-26 09:44:43 -07:00
Brian Behlendorf	de0a1c099b	Autogen refresh for udev changes Run autogen.sh using the same autotools versions as upstream: * autoconf-2.63 * automake-1.11.1 * libtool-2.2.6b	2011-08-08 16:30:27 -07:00
Brian Behlendorf	76659dc110	Add backing_device_info per-filesystem For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #174	2011-08-04 13:37:38 -07:00
Brian Behlendorf	3c0e5c0f45	Cleanup mmap(2) writes While the existing implementation of .writepage()/zpl_putpage() was functional it was not entirely correct. In particular, it would move dirty pages in to a clean state simply after copying them in to the ARC cache. This would result in the pages being lost if the system were to crash enough though the Linux VFS believed them to be safe on stable storage. Since at the moment virtually all I/O, except mmap(2), bypasses the page cache this isn't as bad as it sounds. However, as hopefully start using the page cache more getting this right becomes more important so it's good to improve this now. This patch takes a big step in that direction by updating the code to correctly move dirty pages through a writeback phase before they are marked clean. When a dirty page is copied in to the ARC it will now be set in writeback and a completion callback is registered with the transaction. The page will stay in writeback until the dmu runs the completion callback indicating the page is on stable storage. At this point the page can be safely marked clean. This process is normally entirely asynchronous and will be repeated for every dirty page. This may initially sound inefficient but most of these pages will end up in a few txgs. That means when they are eventually written to disk they should be nicely batched. However, there is room for improvement. It may still be desirable to batch up the pages in to larger writes for the dmu. This would reduce the number of callbacks and small 4k buffer required by the ARC. Finally, if the caller requires that the I/O be done synchronously by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O will trigger a zil_commit() to flush the data to stable storage. At which point the registered callbacks will be run leaving the date safe of disk and marked clean before returning from .writepage. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-02 10:34:55 -07:00
Alexander Stetsenko	0b7936d5c2	Illumos #278 : get rid zfs of python and pyzfs dependencies Remove all python and pyzfs dependencies for consistency and to ensure full functionality even in a mimimalist environment. Reviewed by: gordon.w.ross@gmail.com Reviewed by: trisk@opensolaris.org Reviewed by: alexander.r.eremin@gmail.com Reviewed by: jerry.jelinek@joyent.com Approved by: garrett@nexenta.com References to Illumos issue and patch: - https://www.illumos.org/issues/278 - https://github.com/illumos/illumos-gate/commit/1af68beac3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340 Issue #160 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-01 12:09:36 -07:00
Matt Ahrens	f5fc4acaa7	Illumos #1092 : zfs refratio property Add a "REFRATIO" property, which is the compression ratio based on data referenced. For snapshots, this is the same as COMPRESSRATIO, but for filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie, includes blocks in children, but not blocks shared with the origin). This is needed to figure out how much space a filesystem would use if it were not compressed (ignoring snapshots). Reviewed by: George Wilson <George.Wilson@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Mark Musante <Mark.Musante@oracle.com> Reviewed by: Garrett D'Amore <garrett@nexenta.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/1092 - https://github.com/illumos/illumos-gate/commit/187d6ac08a Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
George Wilson	6d974228ef	Illumos #1051 : zfs should handle imbalanced luns Today zfs tries to allocate blocks evenly across all devices. This means when devices are imbalanced zfs will use lots of CPU searching for space on devices which tend to be pretty full. It should instead fail quickly on the full LUNs and move onto devices which have more availability. Reviewed by: Eric Schrock <Eric.Schrock@delphix.com> Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/510 - https://github.com/illumos/illumos-gate/commit/5ead3ed965 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Kyle Fuller	615ab66d18	Provide a rc.d script for archlinux Unlike most other Linux distributions archlinux installs its init scripts in /etc/rc.d insead of /etc/init.d. This commit provides an archlinux rc.d script for zfs and extends the build infrastructure to ensure it get's installed in the correct place. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #322	2011-07-11 14:12:23 -07:00
Brian Behlendorf	057e8eee35	Improve fstat(2) performance There is at most a factor of 3x performance improvement to be had by using the Linux generic_fillattr() helper. However, to use it safely we need to ensure the values in a cached inode are kept rigerously up to date. Unfortunately, this isn't the case for the blksize, blocks, and atime fields. At the moment the authoritative values are still stored in the znode. This patch introduces an optimized zfs_getattr_fast() call. The idea is to use the up to date values from the inode and the blksize, block, and atime fields from the znode. At some latter date we should be able to strictly use the inode values and further improve performance. The remaining overhead in the zfs_getattr_fast() call can be attributed to having to take the znode mutex. This overhead is unavoidable until the inode is kept strictly up to date. The the careful reader will notice the we do not use the customary ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to ensure the filesystem is not torn down in the middle of an operation. However, in this case the VFS is holding a reference on the active inode so we know this is impossible. =================== Performance Tests ======================== This test calls the fstat(2) system call 10,000,000 times on an open file description in a tight loop. The test results show the zfs stat(2) performance is now only 22% slower than ext4. This is a 2.5x improvement and there is a clear long term plan to get to parity with ext4. filesystem \| test-1 test-2 test-3 \| average \| times-ext4 --------------+-------------------------+---------+----------- ext4 \| 7.785s 7.899s 7.284s \| 7.656s \| 1.000x zfs-0.6.0-rc4 \| 24.052s 22.531s 23.857s \| 23.480s \| 3.066x zfs-faststat \| 9.224s 9.398s 9.485s \| 9.369s \| 1.223x The second test is to run 'du' of a copy of the /usr tree which contains 110514 files. The test is run multiple times both using both a cold cache (/proc/sys/vm/drop_caches) and a hot cache. As expected this change signigicantly improved the zfs hot cache performance and doesn't quite bring zfs to parity with ext4. A little surprisingly the zfs cold cache performance is better than ext4. This can probably be attributed to the zfs allocation policy of co-locating all the meta data on disk which minimizes seek times. By default the ext4 allocator will spread the data over the entire disk only co-locating each directory. filesystem \| cold \| hot --------------+---------+-------- ext4 \| 13.318s \| 1.040s zfs-0.6.0-rc4 \| 4.982s \| 1.762s zfs-faststat \| 4.933s \| 1.345s	2011-07-11 09:11:22 -07:00
Brian Behlendorf	2cf7f52bc4	Linux compat 2.6.39: mount_nodev() The .get_sb callback has been replaced by a .mount callback in the file_system_type structure. When using the new interface the caller must now use the mount_nodev() helper. Unfortunately, the new interface no longer passes the vfsmount down to the zfs layers. This poses a problem for the existing implementation because we currently save this pointer in the super block for latter use. It provides our only entry point in to the namespace layer for manipulating certain mount options. This needed to be done originally to allow commands like 'zfs set atime=off tank' to work properly. It also allowed me to keep more of the original Solaris code unmodified. Under Solaris there is a 1-to-1 mapping between a mount point and a file system so this is a fairly natural thing to do. However, under Linux they many be multiple entries in the namespace which reference the same filesystem. Thus keeping a back reference from the filesystem to the namespace is complicated. Rather than introduce some ugly hack to get the vfsmount and continue as before. I'm leveraging this API change to update the ZFS code to do things in a more natural way for Linux. This has the upside that is resolves the compatibility issue for the long term and fixes several other minor bugs which have been reported. This commit updates the code to remove this vfsmount back reference entirely. All modifications to filesystem mount options are now passed in to the kernel via a '-o remount'. This is the expected Linux mechanism and allows the namespace to properly handle any options which apply to it before passing them on to the file system itself. Aside from fixing the compatibility issue, removing the vfsmount has had the benefit of simplifying the code. This change which fairly involved has turned out nicely. Closes #246 Closes #217 Closes #187 Closes #248 Closes #231	2011-07-01 13:36:39 -07:00
Brian Behlendorf	5c03efc379	Linux compat 2.6.39: security_inode_init_security() The security_inode_init_security() function now takes an additional qstr argument which must be passed in from the dentry if available. Passing a NULL is safe when no qstr is available the relevant security checks will just be skipped. Closes #246 Closes #217 Closes #187	2011-07-01 12:40:08 -07:00
Brian Behlendorf	e2e7aa2df8	Add ZFS specific mmap() checks Under Linux the VFS handles virtually all of the mmap() access checks. Filesystem specific checks are left to be handled in the .mmap() hook and normally there arn't any. However, ZFS provides a few attributes which can influence the mmap behavior and should be honored. Note, currently the code to modify these attributes has not been implemented under Linux. * ZFS_IMMUTABLE \| ZFS_READONLY \| ZFS_APPENDONLY: when any of these attributes are set a file may not be mmaped with write access. * ZFS_AV_QUARANTINED: when set a file file may not be mmaped with read or exec access. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-01 12:23:46 -07:00
Prasad Joshi	dde471ef5a	MMAP Optimization Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions. The functions have been modified to make them Linux friendly. ZFS uses these functions to read/write the mmapped pages. Using them from readpage/writepage results in clear code. The patch also adds readpages and writepages interface functions to read/write list of pages in one function call. The code change handles the first mmap optimization mentioned on https://github.com/behlendorf/zfs/issues/225 Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov> Issue #255	2011-07-01 12:22:52 -07:00
Prasad Joshi	b312979252	Tear down and flush the mmap region The inode eviction should unmap the pages associated with the inode. These pages should also be flushed to disk to avoid the data loss. Therefore, use truncate_setsize() in evict_inode() to release the pagecache. The API truncate_setsize() was added in 2.6.35 kernel. To ensure compatibility with the old kernel, the patch defines its own truncate_setsize function. Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:19 -07:00
Christian Kohlschütter	df30f56639	Add "ashift" property to zpool create Some disks with internal sectors larger than 512 bytes (e.g., 4k) can suffer from bad write performance when ashift is not configured correctly. This is caused by the disk not reporting its actual sector size, but a sector size of 512 bytes. The drive may behave this way for compatibility reasons. For example, the WDC WD20EARS disks are known to exhibit this behavior. When creating a zpool, ZFS takes that wrong sector size and sets the "ashift" property accordingly (to 9: 1<<9=512), whereas it should be set to 12 for 4k sectors (1<<12=4096). This patch allows an adminstrator to manual specify the known correct ashift size at 'zpool create' time. This can significantly improve performance in certain cases. However, it will have an impact on your total pool capacity. See the updated ashift property description in the zpool.8 man page for additional details. Valid values for the ashift property range from 9 to 17 (512B-128KB). Additionally, you may set the ashift to 0 if you wish to auto-detect the sector size based on what the disk reports, this is the default behavior. The most common ashift values are 9 and 12. Example: zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd Closes #280 Original-patch-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-06-17 16:35:49 -07:00
Brian Behlendorf	96801d2906	Linux 2.6.37 compat, WRITE_FLUSH_FUA The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been introduced as a replacement for WRITE_BARRIER. This was done to allow richer semantics to be expressed to the block layer. It is the block layers responsibility to choose the correct way to implement these semantics. This change simply updates the bio's to use the new kernel API which should be absolutely safe. However, since ZFS depends entirely on this working as designed for correctness we do want to be careful. Closes #281	2011-06-17 14:37:26 -07:00
Brian Behlendorf	2e08aedba4	Always check -Wno-unused-but-set-variable gcc support The previous commit `8a7e1ceefa` wasn't quite right. This check applies to both the user and kernel space build and as such we must make sure it runs regardless of what the --with-config option is set too. For example, if --with-config=kernel then the autoconf test does not run and we generate build warnings when compiling the kernel packages.	2011-06-14 16:40:35 -07:00
Brian Behlendorf	8a7e1ceefa	Check for -Wno-unused-but-set-variable gcc support Gcc versions 4.3.2 and earlier do not support the compiler flag -Wno-unused-but-set-variable. This can lead to build failures on older Linux platforms such as Debian Lenny. Since this is an optional build argument this changes add a new autoconf check for the option. If it is supported by the installed version of gcc then it is used otherwise it is omited. See commit's `12c1acde76` and `79713039a2` for the reason the -Wno-unused-but-set-variable options was originally added.	2011-06-14 14:43:22 -07:00
Brian Behlendorf	21ade34764	Disable direct reclaim for z_wr_* threads The direct reclaim path in the z_wr_* threads must be disabled to ensure forward progress is always maintained for txg processing. This ensures that a txg will never get stuck waiting on itself because it entered the following memory reclaim callpath. ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive() ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open() It would be preferable to target this exact code path but the kernel offers no way to do this without custom patches. To avoid this we are forced to disable all reclaim for these threads. It should not be necessary to do this for other other z_* threads because they will not hold a txg open. Closes #232	2011-05-06 15:26:26 -07:00
Brian Behlendorf	3117dd0b90	Handle NULL in nfsd .fsync() hook How nfsd handles .fsync() has been changed a couple of times in the recent kernels. But basically there are three cases we need to consider. Linux 2.6.12 - 2.6.33 * The .fsync() hook takes 3 arguments * The nfsd will call .fsync() with a NULL file struct pointer. Linux 2.6.34 * The .fsync() hook takes 3 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() Linux 2.6.35 - 2.6.x * The .fsync() hook takes 2 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() For once it looks like we've gotten lucky. The first two cases can actually be collased in to one if we stop using the file struct pointer entirely. Since the dentry is still passed in both cases this is possible. The last case can then be safely handled by unconditionally using the dentry in the file struct pointer now that we know the nfsd caller has been removed. Closes #230	2011-05-06 12:33:45 -07:00
Brian Behlendorf	c409e4647f	Add missing ZFS tunables This commit adds module options for all existing zfs tunables. Ideally the average user should never need to modify any of these values. However, in practice sometimes you do need to tweak these values for one reason or another. In those cases it's nice not to have to resort to rebuilding from source. All tunables are visable to modinfo and the list is as follows: $ modinfo module/zfs/zfs.ko filename: module/zfs/zfs.ko license: CDDL author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory description: ZFS srcversion: 8EAB1D71DACE05B5AA61567 depends: spl,znvpair,zcommon,zunicode,zavl vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions parm: zvol_major:Major number for zvol device (uint) parm: zvol_threads:Number of threads for zvol device (uint) parm: zio_injection_enabled:Enable fault injection (int) parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int) parm: zio_delay_max:Max zio millisec delay before posting event (int) parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool) parm: zil_replay_disable:Disable intent logging replay (int) parm: zfs_nocacheflush:Disable cache flushes (bool) parm: zfs_read_chunk_size:Bytes to read per chunk (long) parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int) parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int) parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int) parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int) parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int) parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int) parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int) parm: zfs_vdev_scheduler:I/O scheduler (charp) parm: zfs_vdev_cache_max:Inflate reads small than max (int) parm: zfs_vdev_cache_size:Total size of the per-disk cache (int) parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int) parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int) parm: zfs_recover:Set to attempt to recover from fatal errors (int) parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp) parm: zfs_zevent_len_max:Max event queue length (int) parm: zfs_zevent_cols:Max event column width (int) parm: zfs_zevent_console:Log events to the console (int) parm: zfs_top_maxinflight:Max I/Os per top-level (int) parm: zfs_resilver_delay:Number of ticks to delay resilver (int) parm: zfs_scrub_delay:Number of ticks to delay scrub (int) parm: zfs_scan_idle:Idle window in clock ticks (int) parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int) parm: zfs_free_min_time_ms:Min millisecs to free per txg (int) parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int) parm: zfs_no_scrub_io:Set to disable scrub I/O (bool) parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool) parm: zfs_txg_timeout:Max seconds worth of delta per txg (int) parm: zfs_no_write_throttle:Disable write throttling (int) parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int) parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int) parm: zfs_write_limit_min:Min tgx write limit (ulong) parm: zfs_write_limit_max:Max tgx write limit (ulong) parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong) parm: zfs_write_limit_override:Override tgx write limit (ulong) parm: zfs_prefetch_disable:Disable all ZFS prefetching (int) parm: zfetch_max_streams:Max number of streams per zfetch (uint) parm: zfetch_min_sec_reap:Min time before stream reclaim (uint) parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint) parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong) parm: zfs_pd_blks_max:Max number of blocks to prefetch (int) parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int) parm: zfs_arc_min:Min arc size (ulong) parm: zfs_arc_max:Max arc size (ulong) parm: zfs_arc_meta_limit:Meta limit for arc size (ulong) parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int) parm: zfs_arc_grow_retry:Seconds before growing arc size (int) parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int) parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)	2011-05-04 10:02:37 -07:00
Brian Behlendorf	df554c148e	Fix 'zfs set volsize=N pool/dataset' This change fixes a kernel panic which would occur when resizing a dataset which was not open. The objset_t stored in the zvol_state_t will be set to NULL when the block device is closed. To avoid this issue we pass the correct objset_t as the third arg. The code has also been updated to correctly notify the kernel when the block device capacity changes. For 2.6.28 and newer kernels the capacity change will be immediately detected. For earlier kernels the capacity change will be detected when the device is next opened. This is a known limitation of older kernels. Online ext3 resize test case passes on 2.6.28+ kernels: $ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023 $ zpool create tank /tmp/zvol $ zfs create -V 500M tank/zd0 $ mkfs.ext3 /dev/zd0 $ mkdir /mnt/zd0 $ mount /dev/zd0 /mnt/zd0 $ df -h /mnt/zd0 $ zfs set volsize=800M tank/zd0 $ resize2fs /dev/zd0 $ df -h /mnt/zd0 Original-patch-by: Fajar A. Nugraha <github@fajar.net> Closes #68 Closes #84	2011-05-02 08:54:40 -07:00
Gunnar Beutner	055656d4f4	Implemented NFS export_operations. Implemented the required NFS operations for exporting ZFS datasets using the in-kernel NFS daemon.	2011-04-29 12:36:13 -07:00
Brian Behlendorf	e88b041ed6	Fix libzpool cv_* build error This build failure was accidentally introduced by previous commit `bfd214a` which fixed the load average. Unfortunately, the wrapper for cv_wait_interruptible was not available in the zfs_context.h user compatibility code. I failed to notice this because I didn't rebuild everything cleanly before committing. undefined reference to `cv_wait_interruptible' collect2: ld returned 1 exit status Closes #181	2011-03-31 12:20:53 -07:00
Fajar A. Nugraha	a5729f7b22	Fixes to enable zvol symlink creation This commit fixes issue on https://github.com/behlendorf/zfs/issues/#issue/172 Changes: - update BLKZNAME to use _IOR instead of _IO. Kernel 2.6.32 allows read parameters (copy_to_user) with _IO, while newer kernels (tested Archlinux's 2.6.37 kernel) enforces _IOR (which is correct) - fix return code and message on error Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-03-24 11:48:18 -07:00
Brian Behlendorf	bdf4328b04	Linux 2.6.28 compat, insert_inode_locked() Added insert_inode_locked() helper function, prior to this most callers used insert_inode_hash(). The older method doesn't check for collisions in the inode_hashtable but it still acceptible for use. Fallback to using insert_inode_hash() when insert_inode_locked() is unavailable.	2011-03-22 12:15:54 -07:00
Brian Behlendorf	3517f0b7e9	Linux 2.6.27 compat, blk_queue_stackable() The blk_queue_stackable() queue flag was added in 2.6.27 to handle dm stacking drivers. Prior to this request stacking drivers were detected by checking (q->request_fn == NULL), for earlier kernels we revert to this legacy behavior.	2011-03-22 12:15:54 -07:00
Brian Behlendorf	d6bd8eaae4	Fix evict() deadlock Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility of synchronous reclaim deadlocks. These deadlocks never existed in the original OpenSolaris code because all memory reclaim on Solaris is done asyncronously. Linux does both synchronous (direct) and asynchronous (indirect) reclaim. This commit addresses a deadlock caused by inode eviction. A KM_SLEEP allocation may trigger direct memory reclaim and shrink the inode cache. This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()-> zfs_zinactive() call path the same mutex may be reacquired resulting in a deadlock. To avoid this deadlock the process must not reacquire the mutex when it is already holding it. This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD mutex locking should be reevaluated. This infrastructure already prevents us from ever using the Linux lock dependency analysis tools, and it may limit scalability.	2011-03-22 12:14:55 -07:00
Brian Behlendorf	01c0e61da0	Add init scripts To support automatically mounting your zfs on filesystem on boot a basic init script is needed. Unfortunately, every distribution has their own idea of the _right_ way to do things. Rather than write one very complicated portable init script, which would be invariably replaced by the distributions own anyway. I have instead added support to provide multiple distribution specific init scripts. The correct init script for your distribution will be selected by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT. During 'make install' the correct script for your system will be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the usual /etc/init.d/zfs location. Currently, there is zfs.fedora and a more generic zfs.lsb init script. Hopefully, the distribution maintainers who know best how they want their init scripts to function will feedback their approved versions to be included in the project. This change does not consider upstart jobs but I'm not at all opposed to add that sort of thing.	2011-03-17 16:51:54 -07:00
Brian Behlendorf	0de19dad9c	Register .remount_fs handler Register the missing .remount_fs handler. This handler isn't strictly required because the VFS does a pretty good job updating most of the MS_* flags. However, there's no harm in using the hook to call the registered zpl callback for various MS_* flags. Additionaly, this allows us to lay the ground work for more complicated argument parsing in the future.	2011-03-15 13:33:29 -07:00
Brian Behlendorf	03f9ba9d99	Register .sync_fs handler Register the missing .sync_fs handler. This is a noop in most cases because the usual requirement is that sync just be initiated. As part of the DMU's normal transaction processing txgs will be frequently synced. However, when the 'wait' flag is set the requirement is that .sync_fs must not return until the data is safe on disk. With the addition of the .sync_fs handler this is now properly implemented.	2011-03-15 13:33:29 -07:00
Brian Behlendorf	9ac97c2a93	Print mount/umount errors Because we are dependent of the system mount/umount utilities to ensure correct mtab locking, we should not suppress their error output. During a successful mount/umount they will be silent, but during a failure the error message they print is the only sure way to know why a mount failed. This is because the (u)mount(8) return code does not contain the result of the system call issued. The only way to clearly idenify why thing failed is to rely on the error message printed by the tool. Longer term once libmount is available we can issue the mount/umount system calls within the tool and still be ensured correct mtab locking. Closed #107	2011-03-09 15:26:48 -08:00
Brian Behlendorf	450dc149bd	Range lock performance improvements The original range lock implementation had to be modified by commit `8926ab7` because it was unsafe on Linux. In particular, calling cv_destroy() immediately after cv_broadcast() is dangerous because the waiters may still be asleep. Thus the following cv_destroy() will free memory which may still be in use. This was fixed by updating cv_destroy() to block on waiters but this in turn introduced a deadlock. The deadlock was resolved with the use of a taskq to move the offending free outside the range lock. This worked well but using the taskq for the free resulted in a serious performace hit. This is somewhat ironic because at the time I felt using the taskq might improve things by making the free asynchronous. This patch refines the original fix and moves the free from the taskq to a private free list. Then items which must be free'd are simply inserted in to the list. When the range lock is dropped it's safe to free the items. The list is walked and all rl_t entries are freed. This change improves small cached read performance by 26x. This was expected because for small reads the number of locking calls goes up significantly. More surprisingly this change significantly improves large cache read performance. This probably attributable to better cpu/memory locality. Very likely the same processor which allocated the memory is now freeing it. bs ext3 zfs zfs+fix faster ---------------------------------------------- 512 435 3 79 26x 1k 820 7 160 22x 2k 1536 14 305 21x 4k 2764 28 572 20x 8k 3788 50 1024 20x 16k 4300 86 1843 21x 32k 4505 138 2560 18x 64k 5324 252 3891 15x 128k 5427 276 4710 17x 256k 5427 413 5017 12x 512k 5427 497 5324 10x 1m 5427 521 5632 10x Closes #142	2011-03-08 12:44:06 -08:00
Brian Behlendorf	126400a1ca	Add zfs_open()/zfs_close() In the original implementation the zfs_open()/zfs_close() hooks were dropped for simplicity. This was functional but not 100% correct with the expected ZFS sematics. Updating and re-adding the zfs_open()/zfs_close() hooks resolves the following issues. 1) The ZFS_APPENDONLY file attribute is once again honored. While there are still no Linux tools to set/clear these attributes once there are it should behave correctly. 2) Minimal virus scan file attribute hooks were added. Once again this support in disabled but the infrastructure is back in place. 3) Most importantly correctly handle assigning files which were opened syncronously to the intent log. Without this change O_SYNC modifications could be lost during a system crash even though they were marked synchronous.	2011-03-08 11:04:51 -08:00
Brian Behlendorf	5484965ab6	Drop HAVE_XVATTR macros When I began work on the Posix layer it immediately became clear to me that to integrate cleanly with the Linux VFS certain Solaris specific things would have to go. One of these things was to elimate as many Solaris specific types from the ZPL layer as possible. They would be replaced with their Linux equivalents. This would not only be good for performance, but for the general readability and health of the code. The Solaris and Linux VFS are different beasts and should be treated as such. Most of the code remains common for constructing transactions and such, but there are subtle and important differenced which need to be repsected. This policy went quite for for certain types such as the vnode_t, and it initially seemed to be working out well for the vattr_t. There was a relatively small amount of related xvattr_t code I was forced to comment out with HAVE_XVATTR. But it didn't look that hard to come back soon and replace it all with a native Linux type. However, after going doing this path with xvattr some distance it clear that this code was woven in the ZPL more deeply than I thought. In particular its hooks went very deep in to the ZPL replay code and replacing it would not be as easy as I originally thought. Rather than continue persuing replacing and removing this code I've taken a step back and reevaluted things. This commit reverts many of my previous commits which removed xvattr related code. It restores much of the code to its original upstream state and now relies on improved xvattr_t support in the zfs package itself. The result of this is that much of the code which I had commented out, which accidentally broke things like replay, is now back in place and working. However, there may be a small performance impact for getattr/setattr operations because they now require a translation from native Linux to Solaris types. For now that's a price I'm willing to pay. Once everything is completely functional we can revisting the issue of removing the vattr_t/xvattr_t types. Closes #111	2011-03-02 11:44:34 -08:00
Brian Behlendorf	321a498b95	Add xvattr support With the removal of the minimal xvattr support from the spl this support needs to be replaced in the zfs package. This is fairly easily accomplished by directly adding portions of the sys/vnode.h header from OpenSolaris. These xvattr additions have been placed in the sys/xvattr.h header file and included as needed where simply a sys/vnode.h was included before. In additon to the xvattr types and helper macros two functions were also included. The xva_init() and xva_getxoptattr() functions were included as static inline functions in xvattr.h. They are simple enough and it was simpler to place them here rather than in their own .c file.	2011-03-02 11:43:50 -08:00
Fajar A. Nugraha	4c0d8e50b9	Use udev to create /dev/zvol/[dataset_name] links This commit allows zvols with names longer than 32 characters, which fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102. Changes include: - use /dev/zd* device names for zvol, where * is the device minor (include/sys/fs/zfs.h, module/zfs/zvol.c). - add BLKZNAME ioctl to get dataset name from userland (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id). - add udev rule to create /dev/zvol/[dataset_name] and the legacy /dev/[dataset_name] symlink. For partitions on zvol, it will create /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules, cmd/zvol_id). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-02-25 09:43:19 -08:00
Darik Horn	61da501f9d	Add the new blkdev_compat.h header to the DIST target. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-02-24 09:40:06 -08:00
Brian Behlendorf	45066d1f20	Linux 2.6.38 compat, blkdev_get_by_path() The open_bdev_exclusive() function has been replaced (again) by the more generic blkdev_get_by_path() function. Additionally, the counterpart function close_bdev_exclusive() has been replaced by blkdev_put(). Because these functions are more generic versions of the functions they replaced the compatibility macro must add the FMODE_EXCL mask to ensure they are exclusive. Closes #114	2011-02-23 12:29:38 -08:00
Brian Behlendorf	61e909608d	Linux 2.6.x compat, blkdev_compat.h For legacy reasons the zvol.c and vdev_disk.c Linux compatibility code ended up in sys/blkdev.h and sys/vdev_disk.h headers. While there are worse places for this code to live it should be in a linux/blkdev_compat.h header. This change moves this block device Linux compatibility code in to the linux/blkdev_compat.h header and updates all the correct #include locations. This is not a functional change or bug fix, it is just code cleanup.	2011-02-23 12:29:38 -08:00
Brian Behlendorf	5d0265c0dd	Merge branch 'zpl'	2011-02-18 09:31:25 -08:00
Ricardo M. Correia	54a179e7b8	Add API to wait for pending commit callbacks This adds an API to wait for pending commit callbacks of already-synced transactions to finish processing. This is needed by the DMU-OSD in Lustre during device finalization when some callbacks may still not be called, this leads to non-zero reference count errors. See lustre.org bug 23931.	2011-02-16 11:20:06 -08:00
Brian Behlendorf	2c395def27	Linux 2.6.36 compat, sops->evict_inode() The new prefered inteface for evicting an inode from the inode cache is the ->evict_inode() callback. It replaces both the ->delete_inode() and ->clear_inode() callbacks which were previously used for this.	2011-02-11 13:47:51 -08:00

1 2

81 Commits