watchdog failure at shutdown sometimes prevents reboot


Recommended Posts

Armbianmonitor:

Dear maintainers,

I have my sensors configured to reboot every night via a user cronjob (0 0 * *  * /sbin/reboot), 14 sensors do this without a problem. I've fixed the nanopi neo+2 reboot from NAND some months ago (by re-using friendlyarm first stage u-boot).

 

I just stumbled upon a failed reboot with one of my nanopi neo+2 nodes, after two successful reboots. Looking at the /var/log.hdd/syslog, it got stuck in the shutdown procedure when the watchdog reported a failure. The /var/log.hdd/syslog.1 extracts below show the start of the watchdog, and the stop of the watchdog and its failure. After the failure the system doesn't come up anymore, it needed a powercycle, which is quite inconvenient since it is installed at a hard to access remote location.

Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Starting watchdog daemon...
Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Reached target Graphical Interface.
Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: starting daemon (5.15):
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: int=1s realtime=yes sync=no load=0,0,0 soft=no
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: memory not checked
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: ping: no machine to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: file: no file to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: pidfile: no server process to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: interface: no interface to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: temperature: no sensors to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: no test binary files
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: no repair binary files
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: error retry time-out = 60 seconds
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: repair attempts = 1
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: alive=[none] heartbeat=[none] to=root no_act=no force=no
Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Started watchdog daemon.
...
Sep 19 00:00:01 EnexisVT2-1 CRON[6188]: (dennis) CMD (/sbin/reboot)
...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping Authorization Manager...
...
Sep 19 00:00:02 EnexisVT2-1 watchdog[2212]: stopping daemon (5.15)
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping watchdog daemon...
...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Control process exited, code=exited, status=1/FAILURE
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Failed with result 'exit-code'.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopped watchdog daemon.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Triggering OnFailure= dependencies.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Requested transaction contradicts existing jobs: Transaction for wd_keepalive.service/start is destructive (armbian-zram-confi
g.service has 'stop' job queued, but 'start' is included in transaction).
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Failed to enqueue OnFailure= job, ignoring: Transaction for wd_keepalive.service/start is destructive (armbi
an-zram-config.service has 'stop' job queued, but 'start' is included in transaction).
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopped target Multi-User System.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping rng-tools.service...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping OpenBSD Secure Shell server...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons)...
Sep 19 00:00:02 EnexisVT2-1 ntpd[1396]: ntpd exiting on signal 15 (Terminated)

... cold reboot

Sep 19 00:00:09 EnexisVT2-1 kernel: [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
Sep 19 00:00:09 EnexisVT2-1 fake-hwclock[406]: Sat 19 Sep 2020 12:00:03 AM UTC

 

After this the system didn't boot anymore, and we had to manually cold-boot it. So, I've stopped&disabled the watchdog for now, also had to set run_wd_keepalive=0 in /etc/default/watchdog, since the watchdog also failed to stop from the commandline (also on other systems):

Sep 23 11:33:59 EnexisVT2-1 systemd[1]: Starting watchdog daemon...
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: starting daemon (5.15):
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: int=1s realtime=yes sync=no load=0,0,0 soft=no
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: memory not checked
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: ping: no machine to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: file: no file to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: pidfile: no server process to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: interface: no interface to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: temperature: no sensors to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: no test binary files
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: no repair binary files
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: error retry time-out = 60 seconds
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: repair attempts = 1
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: alive=[none] heartbeat=[none] to=root no_act=no force=no
Sep 23 11:33:59 EnexisVT2-1 systemd[1]: Started watchdog daemon.

...
Sep 23 11:34:03 EnexisVT2-1 watchdog[3236]: stopping daemon (5.15)
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: Stopping watchdog daemon...
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: watchdog.service: Control process exited, code=exited, status=1/FAILURE
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: watchdog.service: Failed with result 'exit-code'.
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: Stopped watchdog daemon.
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: watchdog.service: Triggering OnFailure= dependencies.

 

Note that I froze the armbian upgrades on all these sensors on armbian 20.02.7, to avoid having to recompile my kernel modules on every upstream update. I noticed that the systemd package got an update recently, unsure if this update may mitigate the problem.

systemd-sysv/stable 241-7~deb10u4 arm64 [upgradable from: 241-7~deb10u3]
systemd/stable 241-7~deb10u4 arm64 [upgradable from: 241-7~deb10u3]


dennis@EnexisVT2-1:~$ dpkg -l "*current*"
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                     Version      Architecture Description
+++-========================================-============-============-============================================================
ii  linux-buster-root-current-nanopineoplus2 20.02.1      arm64        Armbian tweaks for buster on nanopineoplus2 (current branch)
hi  linux-dtb-current-sunxi64                20.02.7      arm64        Linux DTB, version 5.4.28-sunxi64
hi  linux-headers-current-sunxi64            20.02.7      arm64        Linux kernel headers for 5.4.28-sunxi64 on arm64
hi  linux-image-current-sunxi64              20.02.7      arm64        Linux kernel, version 5.4.28-sunxi64
hi  linux-u-boot-nanopineoplus2-current      20.02.1      arm64        Uboot loader 2019.10

 

Link to post
Share on other sites
Donate and support the project!

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...