Commit Graph

24 Commits

Author SHA1 Message Date
Brian Behlendorf 6b95031f56
zed: Add deadman-slot_off.sh zedlet
Optionally turn off disk's enclosure slot if an I/O is hung
triggering the deadman.

It's possible for outstanding I/O to a misbehaving SCSI disk to
neither promptly complete or return an error.  This can occur due
to retry and recovery actions taken by the SCSI layer, driver, or
disk.  When it occurs the pool will be unresponsive even though
there may be sufficient redundancy configured to proceeded without
this single disk.

When a hung I/O is detected by the kmods it will be posted as a
deadman event.  By default an I/O is considered to be hung after
5 minutes.  This value can be changed with the zfs_deadman_ziotime_ms
module parameter.  If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set
the disk's enclosure slot will be powered off causing the outstanding
I/O to fail.  The ZED will then handle this like a normal disk failure.
By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set.

As part of this change `zfs_deadman_events_per_second` is added
to control the ratelimitting of deadman events independantly of
delay events.  In practice, a single deadman event is sufficient
and more aren't particularly useful.

Alphabetize the zfs_deadman_* entries in zfs.4.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16226
2024-05-29 10:46:41 -07:00
gofaster a382e21194
Add Gotify notification support to ZED
This commit adds the zed_notify_gotify() function and hooks it
into zed_notify(). This will allow ZED to send notifications
to a self-hosted Gotify service, which can be received
on a desktop or mobile device. It is configured with ZED_GOTIFY_URL,
ZED_GOTIFY_APPTOKEN and ZED_GOTIFY_PRIORITY variables in zed.rc.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gofaster <felix.gofaster@gmail.com>
Closes #15693
2024-01-09 09:49:30 -08:00
Mauricio Faria de Oliveira 3c7650491b
zed: fix typo in variable ZED_POWER_OFF_ENCLO*US*RE_SLOT_ON_FAULT
Replace ENCLO_US_RE with ENCLO_SU_RE in the name of the variable.

Note this changes the user-visible string in zed.rc, thus might
break current users with the wrong string, but it's ~2 months
since zfs-2.2.0 tag is out, thus should not be widespread yet.

Mechanical change:

    $ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT
    cmd/zed/zed.d/zed.rc
    cmd/zed/zed.d/statechange-slot_off.sh

    $ sed -i 's/ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT/<linebreak>
                ZED_POWER_OFF_ENCLOSURE_SLOT_ON_FAULT/g' \
      cmd/zed/zed.d/zed.rc \
      cmd/zed/zed.d/statechange-slot_off.sh

    $ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT
    $

Fixes 11fbcacf37
("zed: Add zedlet to power off slot when drive is faulted")

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes #15651
2023-12-08 16:32:35 -08:00
Dex Wood 014265f4e6
Add Ntfy notification support to ZED
This commit adds the zed_notify_ntfy() function and hooks it
into zed_notify(). This will allow ZED to send notifications
to ntfy.sh or a self-hosted Ntfy service, which can be received
on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC,
ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc.

Reviewed-by: @classabbyamp 
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dex Wood <slash2314@gmail.com>
Closes #15584
2023-12-01 15:25:17 -08:00
Tony Hutter 11fbcacf37
zed: Add zedlet to power off slot when drive is faulted
If ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT is enabled in zed.rc, then
power off the drive's slot in the enclosure if it becomes FAULTED.
This can help silence misbehaving drives.  This assumes your drive
enclosure fully supports slot power control via sysfs.

Reviewed-by: @AllKind
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15200
2023-08-24 11:59:03 -07:00
heeplr 08b32c6fa9
zed: support subject as header in zed_notify_email()
Some minimal MUAs don't support passing the subjects as cmdline option.
This commit checks if "@SUBJECT@" is missing in ZED_EMAIL_OPTS and then
prepends a subject header to the notification message.
Also set a default for ${subject}.

Reviewed-by: Ahelenia Ziemia<C5><84>ska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Daniel Hiepler <d-git@coderdu.de>
Closes #13440
2022-05-18 10:27:53 -07:00
наб ed715283de
cmd: zed: rc: drop "should be owned by root and 0600"
It doesn't matter, 0600 are Weird Permissions, and it's even weirder to
spec them for no reason ‒ it's perfectly fine if it's the usual 0:0 644,
or literally anything else, so long as unprivileged users can't edit it
(which (a) 644 accomplishes and (b) is at the administrator's
 discretion, it's not unheard of to have adm users and having it
 be 664 in that case is just as good; it's not our place to say)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12544
Closes #13276
2022-04-05 13:50:55 -07:00
Damian Szuberski c1d3be19d7
Add ShellCheck's `--enable=all` inside `cmd/`
The only exception is `cmd/vdev_id/vdev_id` which might be a subject of
refactoring (see #12084)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #12912
2022-01-06 16:07:54 -08:00
shodanshok bfb3b0d77a
zed: send notification email by default
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #12806
2021-12-21 16:24:05 -08:00
Tony Hutter ae70d628ff
zed: Control NVMe fault LEDs
The ZED code currently can only turn on the fault LED for
a faulted disk in a JBOD enclosure.  This extends support
for faulted NVMe disks as well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #12648
Closes #12695
2021-11-09 16:50:18 -08:00
Scott Colby da124ad8ec
zed: Add Pushover notifier
Add zed_notify_pushover to zed-functions.sh, along with the necessary
configuration variables in zed.rc.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Scott Colby <scott@scolby.com>
Closes #12012
2021-05-13 10:02:24 -07:00
Don Brady 13d65987a9
zed syslog entries drop important info
ZED will log zevents summaries to the syslog, however the log entries 
tend to drop event details that can be useful for diagnosis. This is 
especially true for ereport events, like io, checksum, and delay.

Update the all-syslog.sh script to log additional event information.

Add an optional config option, ZED_SYSLOG_DISPLAY_GUIDS, to zed.rc
for choosing GUIDs over names for pool and vdev.

Change the default ZED_SYSLOG_SUBCLASS_EXCLUDE to exclude history_event 
events. These events tend to be frequent, convey no meaningful info, 
and are already logged in the zpool history.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #10967
2020-10-19 11:01:00 -07:00
Ben McGough 3768db24ab Adding slack notifier
Allow ZED notification via slack incoming webhook.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Ben McGough <bmcgough@fredhutch.org>
Closes #9076
Closes #9350
2019-09-26 09:52:10 -07:00
kpande 98310e5d1a Update commented zed.rc values to defaults
Update zed.rc values reflect their default value.  This helps
avoid confusion if a user expects functionality to be enabled.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #8498
2019-03-14 09:53:34 -07:00
Tony Hutter 639b18944a Allow to limit zed's syslog chattiness
Some usage patterns like send/recv of replication streams can
produce a large number of events. In such a case, the current
all-syslog.sh zedlet will hold up to its name, and flood the
logs with mostly redundant information. Two mitigate this
situation, this changeset introduces to new variables
ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE
to zed.rc that give more control over which event classes end
up in the syslog.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes #6886 
Closes #7260
2018-03-06 15:41:52 -08:00
Tony Hutter bf95a000c4 Add scrub after resilver zed script
* Add a zed script to kick off a scrub after a resilver.  The script is
disabled by default.

* Add a optional $PATH (-P) option to zed to allow it to use a custom
$PATH for its zedlets.  This is needed when you're running zed under
the ZTS in a local workspace.

* Update test scripts to not copy in all-debug.sh and all-syslog.sh by
default.  They can be optionally copied in as part of zed_setup().
These scripts slow down zed considerably under heavy events loads and
can cause events to be dropped or their delivery delayed. This was
causing some sporadic failures in the 'fault' tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4662 
Closes #7086
2018-02-23 11:38:05 -08:00
Don Brady 0df15db98f Add a statechange notify zedlet
Now that ZED has internal fault diagnosis and the statechange event
is generated for faulted states, we can replace the io-notify and
checksum-notify zedlets with one based on statechange.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #5383
2016-11-10 13:52:59 -08:00
Tony Hutter 6078881aa1 Multipath autoreplace, control enclosure LEDs, event rate limiting
1. Enable multipath autoreplace support for FMA.

This extends FMA autoreplace to work with multipath disks.  This
requires libdevmapper to be installed at build time.

2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online

Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED.  Your enclosure must
be supported by the Linux SES driver for this to work.  The enclosure LED
scripts work for multipath devices as well.  The scripts will clear the LED
when the fault is cleared.

3. Rate limit ZIO delay and checksum events so as not to flood ZED

ZIO delay and checksum events are rate limited to 5/sec in the zfs module.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2449 
Closes #3017 
Closes #5159
2016-10-19 12:55:59 -07:00
Chris Dunlap 69520d6855 Rework zed_notify_email for configurable PROG/OPTS
This commit reworks the zed_notify_email() function to allow
configuration of the mail executable and command-line arguments.

ZED_EMAIL_PROG specifies the name or path of the executable responsible
for sending notifications via email.  This variable defaults to "mail".

ZED_EMAIL_OPTS specifies command-line options passed to ZED_EMAIL_PROG.
The following keyword substitutions are performed:
- @ADDRESS@ is replaced with the recipient email address(es)
- @SUBJECT@ is replaced with the notification subject
This variable defaults to "-s '@SUBJECT@' @ADDRESS@".

ZED_EMAIL_ADDR replaces ZED_EMAIL (although the latter is retained
for backward compatibility).  This variable can contain multiple
addresses as long as they are delimited by whitespace.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3634
Closes #3631
2015-07-30 11:52:56 -07:00
Chris Dunlap a0d065fa92 Add support for Pushbullet notifications
This commit adds the zed_notify_pushbullet() function and hooks
it into zed_notify(), thereby integrating it with the existing
"notify" ZEDLETs.  This enables ZED to push notifications to your
desktop computer and/or mobile device(s).  It is configured with the
ZED_PUSHBULLET_ACCESS_TOKEN and ZED_PUSHBULLET_CHANNEL_TAG variables
in zed.rc.

  https://www.pushbullet.com/

The Makefile install-data-local target has been replaced with
install-data-hook.  With the "-local" target, there is no particular
guarantee of execution order.  But with the zed.rc now potentially
containing sensitive information (i.e., the Pushbullet access token),
the recommended permissions have changed to 0600.  The "-hook" target
is always executed after the main rule's work is done; thus, the
chmod will always take place after the zed.rc file has been installed.

  https://www.gnu.org/software/automake/manual/automake.html#Extending

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
2015-04-27 12:08:08 -07:00
Chris Dunlap 20967ff1a4 Replace "email" ZEDLETs with "notify" ZEDLETs
Several ZEDLETs already exist for sending email in reponse to a
particular zevent.  While email is ubiquitous, alternative methods may
be better suited for some configurations.  Instead of duplicating the
"email" ZEDLETs for every future notification method, it is preferable
to abstract the notification method into a function.  This has the
added benefit of reducing the amount of code duplicated between
ZEDLETs, and allowing related bugs to be fixed in a single location.

This commit replaces the existing "email" ZEDLETs with corresponding
"notify" ZEDLETs.  In addition, the ZEDLET code for sending an
email message has been moved into the zed_notify_email() function.
And this zed_notify_email() has been added to a generic zed_notify()
function for sending notifications via all available methods that
have been configured.

This commit also changes a couple of related zed.rc variables.
ZED_EMAIL_INTERVAL_SECS is changed to ZED_NOTIFY_INTERVAL_SECS,
and ZED_EMAIL_VERBOSE is changed to ZED_NOTIFY_VERBOSE.  Note that
ZED_EMAIL remains unchanged as its use is solely for the email
notification method.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
2015-04-27 12:08:07 -07:00
Chris Dunlap aded9a6814 Cleanup ZEDLETs
This commit factors out several common ZEDLET code blocks into
zed-functions.sh.  This shortens the length of the scripts, thereby
(hopefully) making them easier to understand and maintain.

In addition, this commit revamps the coding style used by the
scripts to be more consistent and (again, hopefully) maintainable.
It now mostly follows the Google Shell Style Guide.  I've tried to
assimilate the following resources:

  Google Shell Style Guide
  https://google-styleguide.googlecode.com/svn/trunk/shell.xml

  Dash as /bin/sh
  https://wiki.ubuntu.com/DashAsBinSh

  Filenames and Pathnames in Shell: How to do it Correctly
  http://www.dwheeler.com/essays/filenames-in-shell.html

  Common shell script mistakes
  http://www.pixelbeat.org/programming/shell_script_mistakes.html

Finally, this commit updates the exit codes used by the ZEDLETs to be
more consistent with one another.

All scripts run cleanly through ShellCheck <http://www.shellcheck.net/>.
All scripts have been tested on bash and dash.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
2015-04-27 12:08:01 -07:00
Brian Behlendorf 904ea2763e Add automatic hot spare functionality
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.

To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev.  This covers both io and checksum zevents but
may be used but other scripts.

In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:

  vdev_spare_paths  - List of vdev paths for all hot spares.
  vdev_spare_guids  - List of vdev guids for all hot spares.
  vdev_read_errors  - Read errors for the problematic vdev
  vdev_write_errors - Write errors for the problematic vdev
  vdev_cksum_errors - Checksum errors for the problematic vdev.

By default the required hot spare scripts are installed but this
functionality is disabled.  To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.

These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system.  It also requires that the
autoexpand policy be ported from Illumos, see:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c

Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
2014-04-02 13:10:08 -07:00
Chris Dunlap 9e246ac3d8 Initial implementation of zed (ZFS Event Daemon)
zed monitors ZFS events.  When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent.  The zevent
nvpairs are passed to the scripts as environment variables.

Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution.  Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents.  Plans are to address this in future commits.

Initial scripts have been developed to log events to syslog
and send email in response to checksum/data/io errors and
resilver.finish/scrub.finish events.  By default, email will only
be sent if the ZED_EMAIL variable is configured in zed.rc (which is
serving as a config file of sorts until a proper configuration file
is implemented).

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
2014-04-02 13:10:03 -07:00