Jump to content

A64 date/time clock issue


Recommended Posts

Let's first recap from the other thread :

 

I found a strange bug and seen it twice in a week :

 

After few days running, my PineA64 had wrong date, some thing like "Tue Mar 13 20:50:11 EDT 2153".

Trying to restart NTP didn't fix it. Trying to set it manually give me an error :

root@pine64:~# date -s "2017-02-03 00:00:01"
date: cannot set date: Invalid argument

Doing an "strace" with it reveal that it can not write system clock :

settimeofday({1486098000, 0}, NULL) = -1 EINVAL (Invalid argument)
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3

Doing search on the net, I've found http://nerdbynature.de/s9y/2009/07/22/cannot-set-date-Invalid-argumentand http://www.mail-archive.com/bug-coreutils@gnu.org/msg14103.html, but not real answers other than it is really the kernel that are f*k*ing up with the system clock. And the issue is gone if I simply reboot the board. Really strange ... Especially that it happened only on PineA64 ...

 

As someone got the similar issue ?

 

And here is Zador's first reply :

 

Is there support for the hardware RTC in mainline (assuming you are testing the mainline)? If yes, then it may be related to wrong RTC settings like external or internal oscillator. If not - then maybe it's related to the arch timer bug? https://github.com/longsleep/linux-pine64/issues/44#issuecomment-263060276
 


I think kernel just prevents unsafe date/time changes, there are several -EINVAL returns in the date/time changing code.

 

And my first reply :

 

I was just looking at DT about this, because if I remember, I didn't had the issue up until recently.

I'm seeing that in the old longsleep DT, it was using 'sun50i-rtc' and in the current it is using 'sun6i-a31-rtc'.

Since I kept also several A64 armbian previous images, I will look into them too.

(I will also check what is the frequency of the occurence, because maybe it happen every 24hrs by reading wrong value from a sync service call. Let see tomorrow ...)

(Another thing I found : the H3 has /dev/rtc and clear trace in dmesg, but not with H5 or A64, no trace at all. But why I don't have the same issue with H5 ?)

 

Now, here is my latest investigations :

 

Both H5 and A64 doesn't probe RTC driver successfully, explaining why dmesg doesn't mentioned it (I hate driver which are not verbose on failures !)

But, why H5 doesn't react the same as A64 ?

I will have to wait until issue re-appear to dig further in /var/log/syslog , maybe tomorrow !

 

BTW, Zador, the -EINVAL seen on "date" command is only shown when symptom occurred, not after reboot... (let's wait tomorrow to see)

As I said, I suspect "systemd" compatiblilty issue here.

 

BTW, each time the issue occurred, the current SSH session has been kicked out, probably due to time elapsed been too large, 120 years ... :P

 

Link to comment
Share on other sites

BTW, Zador, the -EINVAL seen on "date" command is only shown when symptom occurred, not after reboot... (let's wait tomorrow to see)

As I said, I suspect "systemd" compatiblilty issue here.

Because after the reboot time goes back to normal ftom NTP and/or fake-hwclock? And systemd is not the problem here. First, it's the kernel that rejects date/time change, and second - systemd issue affects only 32-bit systems.

Link to comment
Share on other sites

Maybe I've didn't explain it well : ntpdate or ntp server are also facing the same -EINVAL seen with date when symptoms kick in, but not after reboot. Something happened, probably at the 24th hour (I will see later today) which make the symtoms resurrect somehow, probably after a systemd or cron task job. I will check dmesg/syslog next time ...

But, still, why PineA64 is having that trouble and not OPiPC2, they are both using the same kernel branch ?

Link to comment
Share on other sites

6 hours ago, gchain said:

did you fix the problem?

It has been fixed a while ago but having CONFIG_FSL_ERRATUM_A008585 into kernel def_config ...

6 hours ago, gchain said:

I have a similar issue

Which image/kernel are you using ?

 

Link to comment
Share on other sites

Sure, here is the result

 

grep ERRATUM /boot/config*
CONFIG_ARM64_ERRATUM_826319=y
CONFIG_ARM64_ERRATUM_827319=y
CONFIG_ARM64_ERRATUM_824069=y
CONFIG_ARM64_ERRATUM_819472=y
# CONFIG_ARM64_ERRATUM_832075 is not set
CONFIG_ARM64_ERRATUM_845719=y
CONFIG_ARM64_ERRATUM_843419=y
CONFIG_ARM64_ERRATUM_1024718=y
# CONFIG_CAVIUM_ERRATUM_22375 is not set
CONFIG_CAVIUM_ERRATUM_23144=y
# CONFIG_CAVIUM_ERRATUM_23154 is not set
# CONFIG_CAVIUM_ERRATUM_27456 is not set
# CONFIG_CAVIUM_ERRATUM_30115 is not set
# CONFIG_QCOM_FALKOR_ERRATUM_1003 is not set
# CONFIG_QCOM_FALKOR_ERRATUM_1009 is not set
# CONFIG_QCOM_QDF2400_ERRATUM_0065 is not set
CONFIG_HISILICON_ERRATUM_161600802=y
CONFIG_QCOM_FALKOR_ERRATUM_E1041=y
CONFIG_FSL_ERRATUM_A008585=y
# CONFIG_HISILICON_ERRATUM_161010101 is not set
# CONFIG_ARM64_ERRATUM_858921 is not set

 

Link to comment
Share on other sites

1 hour ago, Kalin said:

Sure, here is the result

Oh ! The CONFIG_FSL_ERRATUM_A008585 is present ...

Then, I don't know ... For me, I've never faced the issue again since when that config has been added more than 2 years ago ...

Link to comment
Share on other sites

I am seeing the same issue on a Pine64+, with 4.19.y, 4.20.y, and 5.0.y kernels.  After some time of running OK, setting the date by any means fails as above.  I also noticed that the actual date on the system suddenly goes way off (as in year 2114, for example).

 

I've added this to my list in trying to get my pine64 back as an authoritative NTP server.

Link to comment
Share on other sites

Thanks.  This looks to be a busy weekend for me but I would like to start working on these a little deeper.

 

There do seem to be timer issues post- 4.14.y and I haven't yet determined if that could be due to missing kernel configs or due to underlying kernel code changes.  The former shouldn't be too hard to find. . .just tedious, but the latter might be more involved.

Link to comment
Share on other sites

14 minutes ago, windysea said:

There do seem to be timer issues post- 4.14.y

The timer issue is present since the A64 SoC introduction, it is a silicon bug !

If you look at first post of this thread, I've faced it the first time 2 years ago in Feb 2017, so at that time, it was 4.10.y at that time ...

If people still facing the issue, even after few workarounds, maybe those don't cover all test cases ...

 

Link to comment
Share on other sites

Earlier today ntpd became unable to set the clock again.  The system time does seem to have suddenly jumped:

root@pine64:/sys/kernel/debug# date
Thu May 17 16:29:30 EDT 2114

 

However the RTC (hwclock) appears to be OK:

root@pine64:/# cat /sys/class/rtc/rtc0/{date,time}
2019-03-25
22:23:10
root@pine64:/proc/sys# hwclock --get
2019-03-25 18:23:12.526484-0400

 

There is something attempting to set the RTC but apparently using a bad value:

 

[157688.666308] sun6i-rtc 1f00000.rtc: rtc only supports year in range 1970 - 2033

 

Attempts to set the system clock fail pretty consistently via either settimeofday() or clock_settime():

root@pine64:/proc/sys# strace date 032518322019 2>& 1 | egrep sett 
clock_settime(CLOCK_REALTIME, {tv_sec=1553553120, tv_nsec=0}) = -1 EINVAL (Invalid argument)
settimeofday({tv_sec=1553553120, tv_usec=0}, NULL) = -1 EINVAL (Invalid argument)

root@pine64:/proc/sys# strace hwclock --hctosys 2>& 1 | egrep sett 
settimeofday({tv_sec=1553553315, tv_usec=0}, {tz_minuteswest=240, tz_dsttime=0}) = -1 EINVAL (Invalid argument)
write(2, "settimeofday() failed", 21settimeofday() failed)   = 21

tv_sec looks good, so issue appears to be with kernel.

 

According to man page EINVAL from clock_settime() indicates the specified clock is not supported.  That's not right, so some source perusal would be needed for more detail.

 

According to the man page EINVAL from settimeofday() indicates timezone (or something else) is invalid.  TZ should be NULL in modern implementations, and is valid so this must be "something else".  Not very helpful :(

 

Setting the RTC appears to succeed (confirmed via strace):

root@pine64:/proc# hwclock --set --date="$(hwclock)"
root@pine64:/proc# echo $?
0

 

The next steps will need more debugging enabled in the kernel.

 

I've confirmed the previous applicable kernel configs & errata have all been applied, but perhaps the previous timer-related fixes have become less than effective with the latest kernels?

 

Edit:

I had forgotten to note that the "jump" is indeed 95 years, which suggests the bug with the A64 arch_timer does appear to be the cause:

root@pine64:/# cat /sys/class/rtc/rtc0/since_epoch && date +%s
1553558900
4556039097
root@pine64:/# expr  4556039097 -  1553558900
3002480197
root@pine64:/# expr 3002480197 / 31536000
95

 

Edited by windysea
Add more detail
Link to comment
Share on other sites

Related to the other date time thread, I spent some time digging through patchwork and found that Andre Przywara has put together a useful utility for testing the issue:

 

test_timer.c

After pulling it down and compiling it:

gcc -o test_timer test_timer.c

 

Spoiler

/tmp$ ./test_timer 
TAP version 13
# number of cores: 4
ok 1 same timer frequency on all cores
# timer frequency is 24000000 Hz (24 MHz)
# time1: 1b077aaff7, time2: 1b077aae00, diff: -503
# time1: 1b078a6ff7, time2: 1b078a6e00, diff: -503
# time1: 1b078afff7, time2: 1b078a3e00, diff: -49655
# time1: 1b07918ff7, time2: 1b07918e00, diff: -503
# time1: 1b0791eff7, time2: 1b0791ee00, diff: -503
# time1: 1b0797d7f7, time2: 1b0797d400, diff: -1015
# time1: 1b07a44ff7, time2: 1b07a44e00, diff: -503
# time1: 1b07a5cff7, time2: 1b07a5ce00, diff: -503
# time1: 1b07a65ff7, time2: 1b07a65e00, diff: -503
# time1: 1b07a803f7, time2: 1b07a80200, diff: -503
# time1: 1b07c6cff7, time2: 1b07c6ce00, diff: -503
# time1: 1b07c81ff7, time2: 1b07c81e00, diff: -503
# time1: 1b07d8cff7, time2: 1b07d8ce00, diff: -503
# time1: 1b07dbcff7, time2: 1b07dbce00, diff: -503
# time1: 1b07df8ff7, time2: 1b07df8e00, diff: -503
# time1: 1b07e40ff7, time2: 1b07e40e00, diff: -503
# too many errors, stopping reports
not ok 2 native counter reads are monotonic # 321 errors
# min: -49655, avg: 7, max: 49672
# diffs: -19625, -659667, -19666, -659667, -659708, -19666, -19666, -19625, -659625, -659709, -659667, -19667, -41000, -659708, -19625, -19625
# too many errors, stopping reports
not ok 3 Linux counter reads are monotonic # 154 errors
# min: -659709, avg: 1699, max: 663084
# core 0: counter value: 117082244621 => 4878 sec
# core 0: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 13
# core 1: counter value: 117082246060 => 4878 sec
# core 1: offsets: back-to-back: 10, b-t-b synced: 8, b-t-b w/ delay: 11
# core 2: counter value: 117082247344 => 4878 sec
# core 2: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 11
# core 3: counter value: 117082248610 => 4878 sec
# core 3: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 11
1..3

 

Digging into this further, but at least there's a utility to use with some patches to check to see if the issue gets resolved. 

Link to comment
Share on other sites

Because the "jump" is by 95 years it strongly suggests the known hardware issue.  It is a random occurrence with the SoC, and the existing kernel patches provide an attempt to detect and avoid this bug by doing multiple reads (only a single read would be "bad" with this bug, but that's all it takes)

 

There isn't anything that can be done to detect if this has been truly "fixed" - when this occurs it will be a jump (forward or backward) by 95 years which is immediately apparent (the date will suddenly be in the year 2019 + 95 = 2114 currently.  Otherwise there is no way to predict when/if this might occur.

 

The existing fixes have worked with previous kernels and do seem present here.  It will now just take some deeper review to identify why they appear to  no longer effective.  The actual commits have some decent descriptions of the problem and the fixes put in place.  The new question I have is why the system clock can't be later corrected if this is indeed a one-time random bad read, as is noted in the bug.

Link to comment
Share on other sites

I read somewhere that changing system time with a big difference is rejected by kernel, because it "will take too long".

I have a question - are you saying that the 95 year jump happens at the place where hwclock looks or it is just in the " linux system time"?

I have been running an older kernel (3.x) for few weeks and did not encounter this issue. 

The kernel I used has certain patches related to time, that are visible here: https://github.com/ayufan-pine64/linux-pine64/tree/my-hacks-1.2-with-drm/kernel/time
I am not sure how to tell if these patches are in the armbian kernel.

 

Link to comment
Share on other sites

11 hours ago, brian redbeard said:

Digging into this further, but at least there's a utility to use

Interestingly, I've ran André's tool on 3 of my boards : PineA64, Olinuxino-A64 and OrangePiWin.

The two formers don't show any errors while the OPiWin is, althought I've never seen the 95 years jump on it ...

Link to comment
Share on other sites

2 hours ago, martinayotte said:

Interestingly, I've ran André's tool on 3 of my boards : PineA64, Olinuxino-A64 and OrangePiWin.

The two formers don't show any errors while the OPiWin is, althought I've never seen the 95 years jump on it ...

 

I haven't had a chance to try that tool yet but it looks like it is indeed to check for the instability in question here.

 

This issue appears to be the one that should be resolved in the "fix" that is enabled with the earlier-referenced Kconfig item.  When reading from the hardware architecture timer on the SoC due to instability some bits are erroneously set for that one cycle, but always in an identifiable pattern.  The result is that instability can be (mis)interpreted as a jump approximately in a multiple of 95 years, specifically.

 

The mitigation applied is to read from the timer multiple times in consecutive cycles if needed, as the problem is apparently only transient for a single cycle, and only return what should be a good value.

 

This has been occurring for me every few days, and the clock has always jumped forward by 95 years.  With a 4.14.y kernel I had never seen an issue, nor had I ever seen an issue with a 4.9.y kernel.  I'm intrigued and wish I had more time to spend, but I'm still sticking with it to see if I can find what is happening.

 

I noticed there were very recent commits related to handling the specific register(s) that are the culprit:  CNTVCT and CNTPCT so I want to take a quick look at those to see if they could be related, and hopefully address, this particular issue.

 

Link to comment
Share on other sites

I noticed there is a new fix that appears to have been just brought back from 5.1.y to 5.0.y to replace the fix enabled by setting Kconfig 'SUN50I_A64_UNSTABLE_TIMER'.  The replacement fix extends the check by an additional bit, adds additional coverage,  and adds an additional mitigation for another register that was seen to be affected.  The last item had been referenced in the SUN50I_A64_UNSTABLE_TIMER fix but was not addressed.

 

The new Kconfig item from upstream is 'CONFIG_SUN50I_ERRATUM_UNKNOWN1'

 

Not particularly important, but the SUN50I_A64_UNSTABLE_TIMER fix had been dropped in a recent previous commit but hadn't made it back to 5.0.y yet.  This fix was still present in the kernels used in the armbian builds so that wouldn't have been the cause.  I've built a new DEV (5.0.4) kernel with this latest update and will let it run for a while to see if the issue has been mitigated a little better.

 

Link to comment
Share on other sites

Yes that is the commit that is now part of the mainline that I referenced above.  There are copies of it all over but you can also find it in Linus' own repository now as well as on the kernel.org repository.  This has also been merged into the megous repo that is the source for the armbian builds.

 

This has been running on my built DEV kernel for over 15 hours and has been solid so far.  With an unpatched DEV kernel I had a date jump pretty consistently in less than four hours.  With the NEXT kernel that has the patch for A64_UNSTABLE_TIMER the occurrences varied from a few hours to a few days.

 

In other words this is the same known issue with a now more complete mitigation applied to the mainline kernel upstream.

 

[Edit]:

I've been running a 5.0.4 DEV kernel with the upstream fix for a bit over a day with no issues so far, but that is hardly definitive.  The upstream solution adds logging if a stable timer read could not be obtained within 150 attempts (its bounding, which is also noted as 3 cycles/counter increments) so hopefully if there is a failure here it will be apparent via kernel logs.

 

I've also submitted a PR for the defconfig update for DEV kernels which currently don't have the armbian '0081-arm64-arch_timer-Workaround-for-Allwinner-A64-timer-.patch' applied (by default) anyway.

 

 

Edited by windysea
Progress update
Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines