Its possible for a hardware failure to occur in a way that the ZIL block
writes appear to succeed, but the flush fails.
Because flush errors were being ignored, the lwb chain would finish with
a zero error code, which would result in zil_commit() returning and thus
fsync() returning success to the caller, even though the data was not
recorded in the ZIL.
If the ZIL is on the main pool (no SLOG device) it would typically
suspend around the same time. If that happened before the txg committed,
then those writes are now totally lost - not on the pool, not in the
ZIL.
zil_lwb_flush_vdevs_done() has the necessary code to deal with this
situation, but zio_flush() would never return failure, so it never saw
it. This just allows flushes to report failure, and now we never miss a
failed ZIL write.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit d9db5dccc56b551d0bf66bc9022b6c19a659b7e1)