martinayotte Posted February 5, 2017 Posted February 5, 2017 Let's first recap from the other thread : I found a strange bug and seen it twice in a week : After few days running, my PineA64 had wrong date, some thing like "Tue Mar 13 20:50:11 EDT 2153". Trying to restart NTP didn't fix it. Trying to set it manually give me an error :root@pine64:~# date -s "2017-02-03 00:00:01"date: cannot set date: Invalid argumentDoing an "strace" with it reveal that it can not write system clock :settimeofday({1486098000, 0}, NULL) = -1 EINVAL (Invalid argument)openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3Doing search on the net, I've found http://nerdbynature.de/s9y/2009/07/22/cannot-set-date-Invalid-argumentand http://www.mail-archive.com/bug-coreutils@gnu.org/msg14103.html, but not real answers other than it is really the kernel that are f*k*ing up with the system clock. And the issue is gone if I simply reboot the board. Really strange ... Especially that it happened only on PineA64 ... As someone got the similar issue ? And here is Zador's first reply : Is there support for the hardware RTC in mainline (assuming you are testing the mainline)? If yes, then it may be related to wrong RTC settings like external or internal oscillator. If not - then maybe it's related to the arch timer bug? https://github.com/longsleep/linux-pine64/issues/44#issuecomment-263060276 I think kernel just prevents unsafe date/time changes, there are several -EINVAL returns in the date/time changing code. And my first reply : I was just looking at DT about this, because if I remember, I didn't had the issue up until recently. I'm seeing that in the old longsleep DT, it was using 'sun50i-rtc' and in the current it is using 'sun6i-a31-rtc'. Since I kept also several A64 armbian previous images, I will look into them too. (I will also check what is the frequency of the occurence, because maybe it happen every 24hrs by reading wrong value from a sync service call. Let see tomorrow ...) (Another thing I found : the H3 has /dev/rtc and clear trace in dmesg, but not with H5 or A64, no trace at all. But why I don't have the same issue with H5 ?) Now, here is my latest investigations : Both H5 and A64 doesn't probe RTC driver successfully, explaining why dmesg doesn't mentioned it (I hate driver which are not verbose on failures !) But, why H5 doesn't react the same as A64 ? I will have to wait until issue re-appear to dig further in /var/log/syslog , maybe tomorrow ! BTW, Zador, the -EINVAL seen on "date" command is only shown when symptom occurred, not after reboot... (let's wait tomorrow to see) As I said, I suspect "systemd" compatiblilty issue here. BTW, each time the issue occurred, the current SSH session has been kicked out, probably due to time elapsed been too large, 120 years ...
zador.blood.stained Posted February 5, 2017 Posted February 5, 2017 BTW, Zador, the -EINVAL seen on "date" command is only shown when symptom occurred, not after reboot... (let's wait tomorrow to see) As I said, I suspect "systemd" compatiblilty issue here. Because after the reboot time goes back to normal ftom NTP and/or fake-hwclock? And systemd is not the problem here. First, it's the kernel that rejects date/time change, and second - systemd issue affects only 32-bit systems.
martinayotte Posted February 5, 2017 Author Posted February 5, 2017 Maybe I've didn't explain it well : ntpdate or ntp server are also facing the same -EINVAL seen with date when symptoms kick in, but not after reboot. Something happened, probably at the 24th hour (I will see later today) which make the symtoms resurrect somehow, probably after a systemd or cron task job. I will check dmesg/syslog next time ... But, still, why PineA64 is having that trouble and not OPiPC2, they are both using the same kernel branch ?
zador.blood.stained Posted February 5, 2017 Posted February 5, 2017 But, still, why PineA64 is having that trouble and not OPiPC2, they are both using the same kernel branch ? No idea, the hardware is different and the arch timer bug affects A64 and not H5, but I'm only guessing that arch timer has something to do with the system clock time keeping.
martinayotte Posted February 5, 2017 Author Posted February 5, 2017 Unfortunately, the occurrence is not 24hrs ... So, I will have to wait more, maybe 48hrs, or maybe it is even random ... EDIT : after 48hrs, the issue didn't appeared yet, so I guest it is random ...
martinayotte Posted September 25, 2017 Author Posted September 25, 2017 Strangely, this randomness re-appeared using my newest build 4.13.3 ... (I will check if same thing will happened on OPiWin)
zador.blood.stained Posted September 25, 2017 Posted September 25, 2017 2 minutes ago, martinayotte said: Strangely, this randomness re-appeared using my newest build 4.13.3 ... Next branch? Coincidentally enough "enable-fsl-timer-errata.patch" was not copied to the next branch yet.
martinayotte Posted September 25, 2017 Author Posted September 25, 2017 Yes, it is a build from NEXT ... What this patch is doing exactly ? Sadly, I couldn't check previous history of that patch to see if it explains when previous issue disappeared...
martinayotte Posted September 26, 2017 Author Posted September 26, 2017 Grrrr ... Even with that patch, I've seen the same issue today with my OPiWin jumping 95 years in the future ! It seems to occur only on A64 ...
martinayotte Posted September 28, 2017 Author Posted September 28, 2017 Ok ! I probably find out why : defconfig for both next and dev don't have CONFIG_FSL_ERRATUM_A008585=y, but we had it when we were on 4.11.x ... But, of course, only testing 24/48/72 hrs will prove that it is now really fixed !
gchain Posted March 13, 2019 Posted March 13, 2019 @martinayotte did you fix the problem? I have a similar issue
martinayotte Posted March 13, 2019 Author Posted March 13, 2019 6 hours ago, gchain said: did you fix the problem? It has been fixed a while ago but having CONFIG_FSL_ERRATUM_A008585 into kernel def_config ... 6 hours ago, gchain said: I have a similar issue Which image/kernel are you using ?
Kalin Posted March 22, 2019 Posted March 22, 2019 I am experiencing the same issue on sopine 64 with quite recent kernel: 4.19.20-sunxi64
martinayotte Posted March 22, 2019 Author Posted March 22, 2019 7 hours ago, Kalin said: I am experiencing the same issue on sopine 64 with quite recent kernel: 4.19.20-sunxi64 Can you do a "grep ERRATUM /boot/config*" ?
Kalin Posted March 22, 2019 Posted March 22, 2019 Sure, here is the result grep ERRATUM /boot/config* CONFIG_ARM64_ERRATUM_826319=y CONFIG_ARM64_ERRATUM_827319=y CONFIG_ARM64_ERRATUM_824069=y CONFIG_ARM64_ERRATUM_819472=y # CONFIG_ARM64_ERRATUM_832075 is not set CONFIG_ARM64_ERRATUM_845719=y CONFIG_ARM64_ERRATUM_843419=y CONFIG_ARM64_ERRATUM_1024718=y # CONFIG_CAVIUM_ERRATUM_22375 is not set CONFIG_CAVIUM_ERRATUM_23144=y # CONFIG_CAVIUM_ERRATUM_23154 is not set # CONFIG_CAVIUM_ERRATUM_27456 is not set # CONFIG_CAVIUM_ERRATUM_30115 is not set # CONFIG_QCOM_FALKOR_ERRATUM_1003 is not set # CONFIG_QCOM_FALKOR_ERRATUM_1009 is not set # CONFIG_QCOM_QDF2400_ERRATUM_0065 is not set CONFIG_HISILICON_ERRATUM_161600802=y CONFIG_QCOM_FALKOR_ERRATUM_E1041=y CONFIG_FSL_ERRATUM_A008585=y # CONFIG_HISILICON_ERRATUM_161010101 is not set # CONFIG_ARM64_ERRATUM_858921 is not set
martinayotte Posted March 22, 2019 Author Posted March 22, 2019 1 hour ago, Kalin said: Sure, here is the result Oh ! The CONFIG_FSL_ERRATUM_A008585 is present ... Then, I don't know ... For me, I've never faced the issue again since when that config has been added more than 2 years ago ...
windysea Posted March 22, 2019 Posted March 22, 2019 I am seeing the same issue on a Pine64+, with 4.19.y, 4.20.y, and 5.0.y kernels. After some time of running OK, setting the date by any means fails as above. I also noticed that the actual date on the system suddenly goes way off (as in year 2114, for example). I've added this to my list in trying to get my pine64 back as an authoritative NTP server.
martinayotte Posted March 22, 2019 Author Posted March 22, 2019 BTW, some other fixes related to that in Mainline 5.0.y : CONFIG_SUN50I_A64_UNSTABLE_TIMER
windysea Posted March 22, 2019 Posted March 22, 2019 Thanks. This looks to be a busy weekend for me but I would like to start working on these a little deeper. There do seem to be timer issues post- 4.14.y and I haven't yet determined if that could be due to missing kernel configs or due to underlying kernel code changes. The former shouldn't be too hard to find. . .just tedious, but the latter might be more involved.
martinayotte Posted March 22, 2019 Author Posted March 22, 2019 14 minutes ago, windysea said: There do seem to be timer issues post- 4.14.y The timer issue is present since the A64 SoC introduction, it is a silicon bug ! If you look at first post of this thread, I've faced it the first time 2 years ago in Feb 2017, so at that time, it was 4.10.y at that time ... If people still facing the issue, even after few workarounds, maybe those don't cover all test cases ...
windysea Posted March 22, 2019 Posted March 22, 2019 I'm seeing other issues, such as not being able to configure a non-tickless kernel. If anything, using a constant-rate (non-tickless) should be more stable but with post-14.4.y it is wildly unstable.
windysea Posted March 25, 2019 Posted March 25, 2019 (edited) Earlier today ntpd became unable to set the clock again. The system time does seem to have suddenly jumped: root@pine64:/sys/kernel/debug# date Thu May 17 16:29:30 EDT 2114 However the RTC (hwclock) appears to be OK: root@pine64:/# cat /sys/class/rtc/rtc0/{date,time} 2019-03-25 22:23:10 root@pine64:/proc/sys# hwclock --get 2019-03-25 18:23:12.526484-0400 There is something attempting to set the RTC but apparently using a bad value: [157688.666308] sun6i-rtc 1f00000.rtc: rtc only supports year in range 1970 - 2033 Attempts to set the system clock fail pretty consistently via either settimeofday() or clock_settime(): root@pine64:/proc/sys# strace date 032518322019 2>& 1 | egrep sett clock_settime(CLOCK_REALTIME, {tv_sec=1553553120, tv_nsec=0}) = -1 EINVAL (Invalid argument) settimeofday({tv_sec=1553553120, tv_usec=0}, NULL) = -1 EINVAL (Invalid argument) root@pine64:/proc/sys# strace hwclock --hctosys 2>& 1 | egrep sett settimeofday({tv_sec=1553553315, tv_usec=0}, {tz_minuteswest=240, tz_dsttime=0}) = -1 EINVAL (Invalid argument) write(2, "settimeofday() failed", 21settimeofday() failed) = 21 tv_sec looks good, so issue appears to be with kernel. According to man page EINVAL from clock_settime() indicates the specified clock is not supported. That's not right, so some source perusal would be needed for more detail. According to the man page EINVAL from settimeofday() indicates timezone (or something else) is invalid. TZ should be NULL in modern implementations, and is valid so this must be "something else". Not very helpful Setting the RTC appears to succeed (confirmed via strace): root@pine64:/proc# hwclock --set --date="$(hwclock)" root@pine64:/proc# echo $? 0 The next steps will need more debugging enabled in the kernel. I've confirmed the previous applicable kernel configs & errata have all been applied, but perhaps the previous timer-related fixes have become less than effective with the latest kernels? Edit: I had forgotten to note that the "jump" is indeed 95 years, which suggests the bug with the A64 arch_timer does appear to be the cause: root@pine64:/# cat /sys/class/rtc/rtc0/since_epoch && date +%s 1553558900 4556039097 root@pine64:/# expr 4556039097 - 1553558900 3002480197 root@pine64:/# expr 3002480197 / 31536000 95 Edited March 26, 2019 by windysea Add more detail
brian redbeard Posted March 26, 2019 Posted March 26, 2019 Related to the other date time thread, I spent some time digging through patchwork and found that Andre Przywara has put together a useful utility for testing the issue: test_timer.c After pulling it down and compiling it: gcc -o test_timer test_timer.c Spoiler /tmp$ ./test_timer TAP version 13 # number of cores: 4 ok 1 same timer frequency on all cores # timer frequency is 24000000 Hz (24 MHz) # time1: 1b077aaff7, time2: 1b077aae00, diff: -503 # time1: 1b078a6ff7, time2: 1b078a6e00, diff: -503 # time1: 1b078afff7, time2: 1b078a3e00, diff: -49655 # time1: 1b07918ff7, time2: 1b07918e00, diff: -503 # time1: 1b0791eff7, time2: 1b0791ee00, diff: -503 # time1: 1b0797d7f7, time2: 1b0797d400, diff: -1015 # time1: 1b07a44ff7, time2: 1b07a44e00, diff: -503 # time1: 1b07a5cff7, time2: 1b07a5ce00, diff: -503 # time1: 1b07a65ff7, time2: 1b07a65e00, diff: -503 # time1: 1b07a803f7, time2: 1b07a80200, diff: -503 # time1: 1b07c6cff7, time2: 1b07c6ce00, diff: -503 # time1: 1b07c81ff7, time2: 1b07c81e00, diff: -503 # time1: 1b07d8cff7, time2: 1b07d8ce00, diff: -503 # time1: 1b07dbcff7, time2: 1b07dbce00, diff: -503 # time1: 1b07df8ff7, time2: 1b07df8e00, diff: -503 # time1: 1b07e40ff7, time2: 1b07e40e00, diff: -503 # too many errors, stopping reports not ok 2 native counter reads are monotonic # 321 errors # min: -49655, avg: 7, max: 49672 # diffs: -19625, -659667, -19666, -659667, -659708, -19666, -19666, -19625, -659625, -659709, -659667, -19667, -41000, -659708, -19625, -19625 # too many errors, stopping reports not ok 3 Linux counter reads are monotonic # 154 errors # min: -659709, avg: 1699, max: 663084 # core 0: counter value: 117082244621 => 4878 sec # core 0: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 13 # core 1: counter value: 117082246060 => 4878 sec # core 1: offsets: back-to-back: 10, b-t-b synced: 8, b-t-b w/ delay: 11 # core 2: counter value: 117082247344 => 4878 sec # core 2: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 11 # core 3: counter value: 117082248610 => 4878 sec # core 3: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 11 1..3 Digging into this further, but at least there's a utility to use with some patches to check to see if the issue gets resolved.
windysea Posted March 26, 2019 Posted March 26, 2019 Because the "jump" is by 95 years it strongly suggests the known hardware issue. It is a random occurrence with the SoC, and the existing kernel patches provide an attempt to detect and avoid this bug by doing multiple reads (only a single read would be "bad" with this bug, but that's all it takes) There isn't anything that can be done to detect if this has been truly "fixed" - when this occurs it will be a jump (forward or backward) by 95 years which is immediately apparent (the date will suddenly be in the year 2019 + 95 = 2114 currently. Otherwise there is no way to predict when/if this might occur. The existing fixes have worked with previous kernels and do seem present here. It will now just take some deeper review to identify why they appear to no longer effective. The actual commits have some decent descriptions of the problem and the fixes put in place. The new question I have is why the system clock can't be later corrected if this is indeed a one-time random bad read, as is noted in the bug.
Kalin Posted March 26, 2019 Posted March 26, 2019 I read somewhere that changing system time with a big difference is rejected by kernel, because it "will take too long". I have a question - are you saying that the 95 year jump happens at the place where hwclock looks or it is just in the " linux system time"? I have been running an older kernel (3.x) for few weeks and did not encounter this issue. The kernel I used has certain patches related to time, that are visible here: https://github.com/ayufan-pine64/linux-pine64/tree/my-hacks-1.2-with-drm/kernel/time I am not sure how to tell if these patches are in the armbian kernel.
martinayotte Posted March 26, 2019 Author Posted March 26, 2019 11 hours ago, brian redbeard said: Digging into this further, but at least there's a utility to use Interestingly, I've ran André's tool on 3 of my boards : PineA64, Olinuxino-A64 and OrangePiWin. The two formers don't show any errors while the OPiWin is, althought I've never seen the 95 years jump on it ...
windysea Posted March 26, 2019 Posted March 26, 2019 2 hours ago, martinayotte said: Interestingly, I've ran André's tool on 3 of my boards : PineA64, Olinuxino-A64 and OrangePiWin. The two formers don't show any errors while the OPiWin is, althought I've never seen the 95 years jump on it ... I haven't had a chance to try that tool yet but it looks like it is indeed to check for the instability in question here. This issue appears to be the one that should be resolved in the "fix" that is enabled with the earlier-referenced Kconfig item. When reading from the hardware architecture timer on the SoC due to instability some bits are erroneously set for that one cycle, but always in an identifiable pattern. The result is that instability can be (mis)interpreted as a jump approximately in a multiple of 95 years, specifically. The mitigation applied is to read from the timer multiple times in consecutive cycles if needed, as the problem is apparently only transient for a single cycle, and only return what should be a good value. This has been occurring for me every few days, and the clock has always jumped forward by 95 years. With a 4.14.y kernel I had never seen an issue, nor had I ever seen an issue with a 4.9.y kernel. I'm intrigued and wish I had more time to spend, but I'm still sticking with it to see if I can find what is happening. I noticed there were very recent commits related to handling the specific register(s) that are the culprit: CNTVCT and CNTPCT so I want to take a quick look at those to see if they could be related, and hopefully address, this particular issue.
windysea Posted March 26, 2019 Posted March 26, 2019 I noticed there is a new fix that appears to have been just brought back from 5.1.y to 5.0.y to replace the fix enabled by setting Kconfig 'SUN50I_A64_UNSTABLE_TIMER'. The replacement fix extends the check by an additional bit, adds additional coverage, and adds an additional mitigation for another register that was seen to be affected. The last item had been referenced in the SUN50I_A64_UNSTABLE_TIMER fix but was not addressed. The new Kconfig item from upstream is 'CONFIG_SUN50I_ERRATUM_UNKNOWN1' Not particularly important, but the SUN50I_A64_UNSTABLE_TIMER fix had been dropped in a recent previous commit but hadn't made it back to 5.0.y yet. This fix was still present in the kernels used in the armbian builds so that wouldn't have been the cause. I've built a new DEV (5.0.4) kernel with this latest update and will let it run for a while to see if the issue has been mitigated a little better.
Kalin Posted March 27, 2019 Posted March 27, 2019 Sharing some detailed explanation, available here https://lore.kernel.org/patchwork/patch/1031565/
windysea Posted March 27, 2019 Posted March 27, 2019 (edited) Yes that is the commit that is now part of the mainline that I referenced above. There are copies of it all over but you can also find it in Linus' own repository now as well as on the kernel.org repository. This has also been merged into the megous repo that is the source for the armbian builds. This has been running on my built DEV kernel for over 15 hours and has been solid so far. With an unpatched DEV kernel I had a date jump pretty consistently in less than four hours. With the NEXT kernel that has the patch for A64_UNSTABLE_TIMER the occurrences varied from a few hours to a few days. In other words this is the same known issue with a now more complete mitigation applied to the mainline kernel upstream. [Edit]: I've been running a 5.0.4 DEV kernel with the upstream fix for a bit over a day with no issues so far, but that is hardly definitive. The upstream solution adds logging if a stable timer read could not be obtained within 150 attempts (its bounding, which is also noted as 3 cycles/counter increments) so hopefully if there is a failure here it will be apparent via kernel logs. I've also submitted a PR for the defconfig update for DEV kernels which currently don't have the armbian '0081-arm64-arch_timer-Workaround-for-Allwinner-A64-timer-.patch' applied (by default) anyway. Edited March 28, 2019 by windysea Progress update
Recommended Posts