Jump to content

watchdog failure at shutdown sometimes prevents reboot


Dennboy

Recommended Posts

Armbianmonitor:

Dear maintainers,

I have my sensors configured to reboot every night via a user cronjob (0 0 * *  * /sbin/reboot), 14 sensors do this without a problem. I've fixed the nanopi neo+2 reboot from NAND some months ago (by re-using friendlyarm first stage u-boot).

 

I just stumbled upon a failed reboot with one of my nanopi neo+2 nodes, after two successful reboots. Looking at the /var/log.hdd/syslog, it got stuck in the shutdown procedure when the watchdog reported a failure. The /var/log.hdd/syslog.1 extracts below show the start of the watchdog, and the stop of the watchdog and its failure. After the failure the system doesn't come up anymore, it needed a powercycle, which is quite inconvenient since it is installed at a hard to access remote location.

Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Starting watchdog daemon...
Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Reached target Graphical Interface.
Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: starting daemon (5.15):
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: int=1s realtime=yes sync=no load=0,0,0 soft=no
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: memory not checked
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: ping: no machine to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: file: no file to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: pidfile: no server process to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: interface: no interface to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: temperature: no sensors to check
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: no test binary files
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: no repair binary files
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: error retry time-out = 60 seconds
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: repair attempts = 1
Sep 18 00:03:26 EnexisVT2-1 watchdog[2212]: alive=[none] heartbeat=[none] to=root no_act=no force=no
Sep 18 00:03:26 EnexisVT2-1 systemd[1]: Started watchdog daemon.
...
Sep 19 00:00:01 EnexisVT2-1 CRON[6188]: (dennis) CMD (/sbin/reboot)
...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping Authorization Manager...
...
Sep 19 00:00:02 EnexisVT2-1 watchdog[2212]: stopping daemon (5.15)
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping watchdog daemon...
...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Control process exited, code=exited, status=1/FAILURE
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Failed with result 'exit-code'.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopped watchdog daemon.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Triggering OnFailure= dependencies.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Requested transaction contradicts existing jobs: Transaction for wd_keepalive.service/start is destructive (armbian-zram-confi
g.service has 'stop' job queued, but 'start' is included in transaction).
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: watchdog.service: Failed to enqueue OnFailure= job, ignoring: Transaction for wd_keepalive.service/start is destructive (armbi
an-zram-config.service has 'stop' job queued, but 'start' is included in transaction).
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopped target Multi-User System.
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping rng-tools.service...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping OpenBSD Secure Shell server...
Sep 19 00:00:02 EnexisVT2-1 systemd[1]: Stopping LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons)...
Sep 19 00:00:02 EnexisVT2-1 ntpd[1396]: ntpd exiting on signal 15 (Terminated)

... cold reboot

Sep 19 00:00:09 EnexisVT2-1 kernel: [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
Sep 19 00:00:09 EnexisVT2-1 fake-hwclock[406]: Sat 19 Sep 2020 12:00:03 AM UTC

 

After this the system didn't boot anymore, and we had to manually cold-boot it. So, I've stopped&disabled the watchdog for now, also had to set run_wd_keepalive=0 in /etc/default/watchdog, since the watchdog also failed to stop from the commandline (also on other systems):

Sep 23 11:33:59 EnexisVT2-1 systemd[1]: Starting watchdog daemon...
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: starting daemon (5.15):
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: int=1s realtime=yes sync=no load=0,0,0 soft=no
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: memory not checked
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: ping: no machine to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: file: no file to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: pidfile: no server process to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: interface: no interface to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: temperature: no sensors to check
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: no test binary files
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: no repair binary files
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: error retry time-out = 60 seconds
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: repair attempts = 1
Sep 23 11:33:59 EnexisVT2-1 watchdog[3236]: alive=[none] heartbeat=[none] to=root no_act=no force=no
Sep 23 11:33:59 EnexisVT2-1 systemd[1]: Started watchdog daemon.

...
Sep 23 11:34:03 EnexisVT2-1 watchdog[3236]: stopping daemon (5.15)
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: Stopping watchdog daemon...
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: watchdog.service: Control process exited, code=exited, status=1/FAILURE
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: watchdog.service: Failed with result 'exit-code'.
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: Stopped watchdog daemon.
Sep 23 11:34:03 EnexisVT2-1 systemd[1]: watchdog.service: Triggering OnFailure= dependencies.

 

Note that I froze the armbian upgrades on all these sensors on armbian 20.02.7, to avoid having to recompile my kernel modules on every upstream update. I noticed that the systemd package got an update recently, unsure if this update may mitigate the problem.

systemd-sysv/stable 241-7~deb10u4 arm64 [upgradable from: 241-7~deb10u3]
systemd/stable 241-7~deb10u4 arm64 [upgradable from: 241-7~deb10u3]


dennis@EnexisVT2-1:~$ dpkg -l "*current*"
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                     Version      Architecture Description
+++-========================================-============-============-============================================================
ii  linux-buster-root-current-nanopineoplus2 20.02.1      arm64        Armbian tweaks for buster on nanopineoplus2 (current branch)
hi  linux-dtb-current-sunxi64                20.02.7      arm64        Linux DTB, version 5.4.28-sunxi64
hi  linux-headers-current-sunxi64            20.02.7      arm64        Linux kernel headers for 5.4.28-sunxi64 on arm64
hi  linux-image-current-sunxi64              20.02.7      arm64        Linux kernel, version 5.4.28-sunxi64
hi  linux-u-boot-nanopineoplus2-current      20.02.1      arm64        Uboot loader 2019.10

 

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines