OpenZFS on Linux and FreeBSD
Go to file
Richard Yao 0b43696e66 random_get_pseudo_bytes() need not provide cryptographic strength entropy
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:

http://xorshift.di.unimi.it/#speed

The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.

Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.

We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:

1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.

2.  Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.

3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.

4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.

5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.

6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`).  In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372
2016-02-17 09:49:09 -08:00
cmd Add a script to display SPL slab cache statistics 2015-12-02 14:08:08 -08:00
config Skip GPL-only symbols test when cross-compiling 2015-12-14 10:40:33 -08:00
include random_get_pseudo_bytes() need not provide cryptographic strength entropy 2016-02-17 09:49:09 -08:00
lib Remove autotools products 2012-08-27 11:46:23 -07:00
man Allow kicking a taskq to spawn more threads 2016-02-05 14:08:31 -08:00
module random_get_pseudo_bytes() need not provide cryptographic strength entropy 2016-02-17 09:49:09 -08:00
rpm Create spl-kmod-debuginfo rpm with redhat spec file 2016-01-21 11:22:02 -08:00
scripts Support parallel build trees (VPATH builds) 2015-07-17 12:53:11 -07:00
.gitignore Ignore *.{deb,rpm,tar.gz} files in the top directory. 2013-04-24 16:18:14 -07:00
AUTHORS Refresh AUTHORS 2012-12-19 09:40:18 -08:00
COPYING Public Release Prep 2010-05-17 15:18:00 -07:00
DISCLAIMER Public Release Prep 2010-05-17 15:18:00 -07:00
META Tag spl-0.6.5 2015-09-10 12:33:51 -07:00
Makefile.am Support parallel build trees (VPATH builds) 2015-07-17 12:53:11 -07:00
README.markdown Document how to run SPLAT 2013-10-09 13:52:59 -07:00
autogen.sh build: do not call boilerplate ourself 2013-04-02 11:08:46 -07:00
configure.ac Add a script to display SPL slab cache statistics 2015-12-02 14:08:08 -08:00
copy-builtin Ensure spl/ only occurs once in core-y 2016-01-26 11:54:24 -08:00
spl.release.in Move spl.release generation to configure step 2012-07-12 12:13:47 -07:00

README.markdown

The Solaris Porting Layer (SPL) is a Linux kernel module which provides many of the Solaris kernel APIs. This shim layer makes it possible to run Solaris kernel code in the Linux kernel with relatively minimal modification. This can be particularly useful when you want to track upstream Solaris development closely and do not want the overhead of maintaining a large patch which converts Solaris primitives to Linux primitives.

To build packages for your distribution:

$ ./configure
$ make pkg

If you are building directly from the git tree and not an officially released tarball you will need to generate the configure script. This can be done by executing the autogen.sh script after installing the GNU autotools for your distribution.

To copy the kernel code inside your kernel source tree for builtin compilation:

$ ./configure --enable-linux-builtin --with-linux=/usr/src/linux-...
$ ./copy-builtin /usr/src/linux-...

The SPL comes with an automated test suite called SPLAT. The test suite is implemented in two parts. There is a kernel module which contains the tests and a user space utility which controls which tests are run. To run the full test suite:

$ sudo insmod ./module/splat/splat.ko
$ sudo ./cmd/splat --all

Full documentation for building, configuring, testing, and using the SPL can be found at: http://zfsonlinux.org