Update dRAID pool creation syntax.

Brian Behlendorf 2020-03-04 15:48:56 -08:00
parent 1fb5f10f2b
commit 01d0da995b
1 changed files with 39 additions and 97 deletions

@ -24,127 +24,72 @@ The dRAID vdev must shuffle its child drives in a way that regardless of which d
Parity declustering (the fancy term for shuffling drives) has been an active research topic, and many papers have been published in this area. The [Permutation Development Data Layout](http://www.cse.scu.edu/~tschwarz/TechReports/hpca.pdf) is a good paper to begin. The dRAID vdev driver uses a shuffling algorithm loosely based on the mechanism described in this paper. Parity declustering (the fancy term for shuffling drives) has been an active research topic, and many papers have been published in this area. The [Permutation Development Data Layout](http://www.cse.scu.edu/~tschwarz/TechReports/hpca.pdf) is a good paper to begin. The dRAID vdev driver uses a shuffling algorithm loosely based on the mechanism described in this paper.
# Use dRAID # Using dRAID
First get the code [here](https://github.com/zfsonlinux/zfs/pull/9558), build zfs with _configure --enable-debug_, and install. Then load the zfs kernel module with the following options: First get the code [here](https://github.com/openzfs/zfs/pull/10102), build zfs with _configure --enable-debug_, and install. Then load the zfs kernel module with the following options which help dRAID rebuild performance.
* zfs_vdev_scrub_max_active=10 zfs_vdev_async_write_min_active=4: These options help dRAID rebuild performance.
* draid_debug_lvl=5: This option controls the verbosity level of dRAID debug traces, which is very useful for troubleshooting.
Again, very important to _configure_ both spl and zfs with _--enable-debug_. * zfs_vdev_scrub_max_active=10
* zfs_vdev_async_write_min_active=4
## Create a dRAID vdev ## Create a dRAID vdev
Unlike a raidz vdev, before a dRAID vdev can be created, a configuration file must be created with the _draidcfg_ command: Similar to raidz vdev a dRAID vdev can be created using the `zpool create` command:
``` ```
# draidcfg -p 1 -d 4 -s 2 -n 17 17.nvl # zpool create <pool> draid[1,2,3][ <vdevs...>
Not enough entropy at /dev/random: read -1, wanted 8.
Using /dev/urandom instead.
Worst ( 3 x 5 + 2) x 544: 0.882
Seed chosen: f0cbfeccac3071b0
``` ```
The command in the example above creates a configuration for a 17-drive dRAID1 vdev with 4 data blocks per strip and 2 distributed spares, and saves it to file _17.nvl_. Options: Unlike raidz, additional options may be provided as part of the `draid` vdev type to specify an exact dRAID layout. When unspecific reasonable defaults will be chosen.
* p: parity level, can be 1, 2, or 3.
* d: # data blocks per stripe.
* s: # distributed spare
* n: total # of drives
* It's required that: (n - s) % (p + d) == 0
Note that:
* Errors like "Not enough entropy at /dev/random" are harmless
* In the future, the _draidcfg_ may get integrated into _zpool create_ so there'd be no separate step for configuration generation.
The configuration file is binary, to examine the contents:
``` ```
# draidcfg -r 17.nvl # zpool create <pool> draid[1,2,3][:<groups>g][:<spares>s][:<data>d][:<iterations>] <vdevs...>
dRAID1 vdev of 17 child drives: 3 x (4 data + 1 parity) and 2 distributed spare
Using 32 base permutations
1,12,13, 5,15,11, 2, 6, 4,16, 9, 7,14,10, 3, 0, 8,
0, 1, 5,10, 8, 6,15, 4, 7,14, 2,13,12, 3,11,16, 9,
1, 7,11,13,14,16, 4,12, 0,15, 9, 2,10, 3, 6, 5, 8,
5,16, 3,15,10, 0,13,11,12, 8, 2, 9, 6, 4, 7, 1,14,
9,15, 6, 8,12,11, 7, 1, 3, 0,13, 5,16,14, 4,10, 2,
10, 1, 5,11, 3, 6,15, 2,12,13, 9, 4,16,14, 0, 7, 8,
10,16,12, 7, 1, 3, 9,14, 5,15, 4,11, 2, 0,13, 8, 6,
7,12, 4,13, 6,11, 9,15,14, 2,16, 3, 0, 1,10, 5, 8,
10, 5, 8, 2, 1,11,16,15,12, 3,13, 4, 0, 7, 9, 6,14,
1, 6,15, 0,14, 5, 9,11, 8,16,10, 2,13,12, 3, 4, 7,
14, 4, 2, 0,12, 7, 3, 6, 8,13,10, 1,11,16,15, 9, 5,
6,14, 8,10, 1, 0,15, 4, 5, 3,16,13, 9,12, 2, 7,11,
13, 5, 8,14, 1,10,16,11,15, 7, 0,12, 2, 9, 4, 6, 3,
9, 6, 3, 7,15, 1, 4, 8,14, 5, 0, 2,16,10,12,11,13,
12, 0, 6, 7, 1, 9,14, 8,11,16, 4, 2,13,15, 3, 5,10,
14, 6,12,10,15,13, 7, 0, 3,16, 5, 9, 2, 8, 4,11, 1,
15,16, 8,13, 6, 4, 7,11, 1, 2,14,12,10, 5, 9, 3, 0,
0,11,10,14,12, 1,16, 3,13, 9, 5, 7, 2, 4, 6,15, 8,
2,10,12, 4, 3, 5,15, 1,11, 0, 7,13, 6, 9,14, 8,16,
11, 8,16,12, 6,13,10, 9, 2, 7, 3, 4, 5, 0,14,15, 1,
4,16,12,15,14, 3, 7, 1, 9,10, 6, 8,11, 0,13, 2, 5,
5,16,13,11, 4, 6, 7,12, 0, 9,15, 1,14, 3, 8,10, 2,
12, 6, 7, 0,10,15, 8, 2,16,14,11, 1, 4, 5, 9,13, 3,
8, 4, 1,13, 6, 5, 0,15, 7, 3,11,14,16, 9,10,12, 2,
16,14,15, 2,10,11, 6,13, 4, 9, 8, 0, 5,12, 3, 1, 7,
9, 6, 8, 3,12,14,16,13,11,10, 4, 5, 7,15, 2, 0, 1,
3, 9,15, 0, 7, 1, 8,11,12, 2,10, 6,13,16, 5,14, 4,
0,14, 6,16, 1,10, 9,15,12, 8,11, 3, 2, 7,13, 5, 4,
12,13, 9, 5,11, 6, 3, 4,14,10, 1, 7, 8, 2, 0,16,15,
16, 9, 0, 2, 3,10, 1,11, 6, 4,13,12,14, 7, 5,15, 8,
16, 9, 6, 0, 1, 4,11,14,12, 3, 2,15,13,10, 5, 8, 7,
7, 8,11,14,10, 6,15,13, 1, 4,16, 9, 2, 3, 0,12, 5,
``` ```
Now a dRAID vdev can be created using the configuration. The only difference from a normal _zpool create_ is the addition of a configuration file in the vdev specification: * groups - Number of redundancy groups (default: 1 group per 12 vdevs)
``` * spares - Number of distributed hot spares (default: 1)
# zpool create -f tank draid1 cfg=17.nvl sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt * data - Number of data devices per group (default: determined by number of groups)
``` * iterations - Number of iterations to perform generating a valid dRAID mapping (default 3).
Note that:
* The total number of drives must equal the _-n_ option of _draidcfg_.
* The parity level must match the _-p_ option, e.g. use draid3 for _draidcfg -p 3_
When the numbers don't match, _zpool create_ will fail but with a generic error message, which can be confusing. _Notes_:
* The default values are not set in stone and may change.
* For the majority of common configurations we intend to provide pre-computed balanced dRAID mappings.
* When _data_ is specified then: (draid_children - spares) % (parity + data) == 0, otherwise the pool creation will fail.
Now the dRAID vdev is online and ready for IO: Now the dRAID vdev is online and ready for IO:
``` ```
# zpool status
pool: tank pool: tank
state: ONLINE state: ONLINE
config: config:
NAME STATE READ WRITE CKSUM NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0 tank ONLINE 0 0 0
draid1-0 ONLINE 0 0 0 draid2:4g:2s-0 ONLINE 0 0 0
sdd ONLINE 0 0 0 L0 ONLINE 0 0 0
sde ONLINE 0 0 0 L1 ONLINE 0 0 0
sdf ONLINE 0 0 0 L2 ONLINE 0 0 0
sdg ONLINE 0 0 0 L3 ONLINE 0 0 0
sdh ONLINE 0 0 0 ...
sdu ONLINE 0 0 0 L50 ONLINE 0 0 0
sdj ONLINE 0 0 0 L51 ONLINE 0 0 0
sdv ONLINE 0 0 0 L52 ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
spares spares
%draid1-0-s0 AVAIL s0-draid2:4g:2s-0 AVAIL
%draid1-0-s1 AVAIL s1-draid2:4g:2s-0 AVAIL
errors: No known data errors
``` ```
There are two logical spare vdevs shown above at the bottom: There are two logical hot spare vdevs shown above at the bottom:
* The names begin with a '%' followed by the name of the parent dRAID vdev. * The names begin with a `s<id>-` followed by the name of the parent dRAID vdev.
* These spare are logical, made from reserved blocks on all the 17 child drives of the dRAID vdev. * These hot spares are logical, made from reserved blocks on all the 53 child drives of the dRAID vdev.
* Unlike traditional hot spares, the distributed spare can only replace a drive in its parent dRAID vdev. * Unlike traditional hot spares, the distributed spare can only replace a drive in its parent dRAID vdev.
The dRAID vdev behaves just like a raidz vdev of the same parity level. You can do IO to/from it, scrub it, fail a child drive and it'd operate in degraded mode. The dRAID vdev behaves just like a raidz vdev of the same parity level. You can do IO to/from it, scrub it, fail a child drive and it'd operate in degraded mode.
## Rebuild to distributed spare ## Rebuild to distributed spare
When there's a bad/offlined/failed child drive, the dRAID vdev supports a completely new mechanism to reconstruct lost data/parity, in addition to the resilver. First of all, resilver is still supported - if a failed drive is replaced by another physical drive, the resilver process is used to reconstruct lost data/parity to the new replacement drive, which is the same as a resilver in a raidz vdev. When there's a failed/offline child drive, the dRAID vdev supports a completely new mechanism to reconstruct lost data/parity, in addition to the resilver. First of all, resilver is still supported - if a failed drive is replaced by another physical drive, the resilver process is used to reconstruct lost data/parity to the new replacement drive, which is the same as a resilver in a raidz vdev.
But if a child drive is replaced with a distributed spare, a new process called rebuild is used instead of resilver: But if a child drive is replaced with a distributed spare, a new process called rebuild is used instead of resilver:
``` ```
@ -341,7 +286,4 @@ The dRAID1 vdev in this example shuffles three (4 data + 1 parity) redundancy gr
# Troubleshooting # Troubleshooting
Please report bugs to [the dRAID PR](https://github.com/zfsonlinux/zfs/pull/9558), as long as the code is not merged upstream. The following information would be useful: Please report bugs to [the dRAID PR](https://github.com/zfsonlinux/zfs/pull/10102), as long as the code is not merged upstream.
* dRAID configuration, i.e. the *.nvl file created by _draidcfg_ command.
* Output of _zpool events -v_
* dRAID debug traces, which by default goes to _dmesg_ via _printk()_. The dRAID debugging traces can also use _trace_printk()_, which is more preferable but unfortunately GPL only.