windysea

Members
  • Content Count

    42
  • Joined

  • Last visited

 Content Type 

Forums

Member Map

Store

Crowdfunding

Raffles

Applications

Calendar

Everything posted by windysea

  1. I haven't had a chance to try that tool yet but it looks like it is indeed to check for the instability in question here. This issue appears to be the one that should be resolved in the "fix" that is enabled with the earlier-referenced Kconfig item. When reading from the hardware architecture timer on the SoC due to instability some bits are erroneously set for that one cycle, but always in an identifiable pattern. The result is that instability can be (mis)interpreted as a jump approximately in a multiple of 95 years, specifically. The mitigation applied is to read from the timer multiple times in consecutive cycles if needed, as the problem is apparently only transient for a single cycle, and only return what should be a good value. This has been occurring for me every few days, and the clock has always jumped forward by 95 years. With a 4.14.y kernel I had never seen an issue, nor had I ever seen an issue with a 4.9.y kernel. I'm intrigued and wish I had more time to spend, but I'm still sticking with it to see if I can find what is happening. I noticed there were very recent commits related to handling the specific register(s) that are the culprit: CNTVCT and CNTPCT so I want to take a quick look at those to see if they could be related, and hopefully address, this particular issue.
  2. Because the "jump" is by 95 years it strongly suggests the known hardware issue. It is a random occurrence with the SoC, and the existing kernel patches provide an attempt to detect and avoid this bug by doing multiple reads (only a single read would be "bad" with this bug, but that's all it takes) There isn't anything that can be done to detect if this has been truly "fixed" - when this occurs it will be a jump (forward or backward) by 95 years which is immediately apparent (the date will suddenly be in the year 2019 + 95 = 2114 currently. Otherwise there is no way to predict when/if this might occur. The existing fixes have worked with previous kernels and do seem present here. It will now just take some deeper review to identify why they appear to no longer effective. The actual commits have some decent descriptions of the problem and the fixes put in place. The new question I have is why the system clock can't be later corrected if this is indeed a one-time random bad read, as is noted in the bug.
  3. Earlier today ntpd became unable to set the clock again. The system time does seem to have suddenly jumped: root@pine64:/sys/kernel/debug# date Thu May 17 16:29:30 EDT 2114 However the RTC (hwclock) appears to be OK: root@pine64:/# cat /sys/class/rtc/rtc0/{date,time} 2019-03-25 22:23:10 root@pine64:/proc/sys# hwclock --get 2019-03-25 18:23:12.526484-0400 There is something attempting to set the RTC but apparently using a bad value: [157688.666308] sun6i-rtc 1f00000.rtc: rtc only supports year in range 1970 - 2033 Attempts to set the system clock fail pretty consistently via either settimeofday() or clock_settime(): root@pine64:/proc/sys# strace date 032518322019 2>& 1 | egrep sett clock_settime(CLOCK_REALTIME, {tv_sec=1553553120, tv_nsec=0}) = -1 EINVAL (Invalid argument) settimeofday({tv_sec=1553553120, tv_usec=0}, NULL) = -1 EINVAL (Invalid argument) root@pine64:/proc/sys# strace hwclock --hctosys 2>& 1 | egrep sett settimeofday({tv_sec=1553553315, tv_usec=0}, {tz_minuteswest=240, tz_dsttime=0}) = -1 EINVAL (Invalid argument) write(2, "settimeofday() failed", 21settimeofday() failed) = 21 tv_sec looks good, so issue appears to be with kernel. According to man page EINVAL from clock_settime() indicates the specified clock is not supported. That's not right, so some source perusal would be needed for more detail. According to the man page EINVAL from settimeofday() indicates timezone (or something else) is invalid. TZ should be NULL in modern implementations, and is valid so this must be "something else". Not very helpful Setting the RTC appears to succeed (confirmed via strace): root@pine64:/proc# hwclock --set --date="$(hwclock)" root@pine64:/proc# echo $? 0 The next steps will need more debugging enabled in the kernel. I've confirmed the previous applicable kernel configs & errata have all been applied, but perhaps the previous timer-related fixes have become less than effective with the latest kernels? Edit: I had forgotten to note that the "jump" is indeed 95 years, which suggests the bug with the A64 arch_timer does appear to be the cause: root@pine64:/# cat /sys/class/rtc/rtc0/since_epoch && date +%s 1553558900 4556039097 root@pine64:/# expr 4556039097 - 1553558900 3002480197 root@pine64:/# expr 3002480197 / 31536000 95
  4. I have been running a "stock" DEV kernel (5.0.y) on my PineA64+ while trying to wait for the previously noted (by others) date-setting issue to occur. "stock" being a 'compile.sh' using 'BRANCH="dev" using the PineA64 defconfig (BOARD="pine64" with no config changes) A date/time setting problem just happened a little while ago and as I was about to start to try to figure out where the kernel went wrong when I got a process crash with an apparent CPU stalled error: [154619.384626] rcu: INFO: rcu_sched self-detected stall on CPU [154619.390319] rcu: 2-...0: (1 GPs behind) idle=78a/1/0x4000000000000002 softirq=1062220/1062259 fqs=1174 [154619.399880] rcu: (t=266523 jiffies g=2266349 q=1232) [154619.405109] Task dump for CPU 2: [154619.408421] systemd R running task 0 1 0 0x00000002 [154619.415553] Call trace: [154619.418099] dump_backtrace+0x0/0x1b0 [154619.421849] show_stack+0x14/0x20 [154619.425252] sched_show_task+0x154/0x188 [154619.429259] dump_cpu_task+0x40/0x50 [154619.432919] rcu_dump_cpu_stacks+0xc4/0x104 [154619.437186] rcu_check_callbacks+0x694/0x760 [154619.441539] update_process_times+0x2c/0x58 [154619.445807] tick_sched_handle.isra.5+0x30/0x48 [154619.450420] tick_sched_timer+0x48/0x98 [154619.454339] __hrtimer_run_queues+0xe4/0x1f0 [154619.458695] hrtimer_interrupt+0xf4/0x2b0 [154619.462793] arch_timer_handler_phys+0x28/0x40 [154619.467322] handle_percpu_devid_irq+0x80/0x138 [154619.471936] generic_handle_irq+0x24/0x38 [154619.476029] __handle_domain_irq+0x5c/0xb0 [154619.480208] gic_handle_irq+0x58/0xa8 [154619.483954] el0_irq_naked+0x4c/0x54 I've been seeing these occasionally for random processes, but mostly with 'swapper'. This one is particularly problematic as 'systemd' itself has crashed. Without systemd, there is no equivalent 'init', and restarting a crashed systemd isn't always the best idea. . .but that's a different problem. I'm going to leave the above aside for now but I'll look to see if I can get the entire stack dump gets logged persistently (syslog to persistent storage) to look for any patterns, just in case anyone else is seeing these. Enabling RCU tracing may be helpful but that'll be for another time. The actual 'rcu' messages get logged currently, but not the call trace (something else to look into) In the meantime I'll try to dig into the date/time setting issue which I'll document in a separate thread once I have more.
  5. Some SBCs, such as PIneA64, provide a built-in RTC on-board while others do not. For those boards that do have a built-in RTC, it appears the kernel driver is configured as a builtin rather than as a module at least in some cases (I only checked a few). This in turn brings up a question: For boards that include a built-in RTC and also have the respective kernel driver configured as a built-in, should the kernel configuration to also set the system clock from that RTC (RTC_HCTOSYS) be enabled by default? There is a dependency that the RTC be built-in as the kernel will try to read from the RTC rather early, before any loadable modules would have been loaded. That would potentially preclude this from being a common default, but doesn't necessarily need to. For cases where this can be done, this has the advantage of not requiring separate user-space accommodations. For distributions based on systemd there are sadly a very large number of discussions and proposals on how to set the system time from RTC at boot time but also sadly there is no standard nor common solution. /lib/udev/hwclock-set explicitly exits without setting the system clock if systemd is found, and systemd will not read the RTC itself expecting to use a network time source (NTP) instead. Having the kernel set system time from RTC would happen a bit earlier than any user-land option which would result in consistent and correct timestamps on files created/modified during startup as well as in various logs. It may not be a perfect nor complete solution, but where possible does this seem like something that could/should be done?
  6. I'm seeing other issues, such as not being able to configure a non-tickless kernel. If anything, using a constant-rate (non-tickless) should be more stable but with post-14.4.y it is wildly unstable.
  7. Thanks. This looks to be a busy weekend for me but I would like to start working on these a little deeper. There do seem to be timer issues post- 4.14.y and I haven't yet determined if that could be due to missing kernel configs or due to underlying kernel code changes. The former shouldn't be too hard to find. . .just tedious, but the latter might be more involved.
  8. I am seeing the same issue on a Pine64+, with 4.19.y, 4.20.y, and 5.0.y kernels. After some time of running OK, setting the date by any means fails as above. I also noticed that the actual date on the system suddenly goes way off (as in year 2114, for example). I've added this to my list in trying to get my pine64 back as an authoritative NTP server.
  9. Thanks! I've submitted the a PR with two separate commits, one each for -dev and -next, with the exact same change. I didn't know if you would accept PRs on -next and meant to create two PRs but the second commit was merged with the first PR that had already been submitted. . .
  10. I had submitted a PR last year that was committed to the tree at https://github.com/armbian/sunxi-DT-overlays but this doesn't seem to have made it to the mainline nor is it included in any pine64 pre-built kernels. It is a very simple one-line correction from For now I've been using a local user patch when rebuilding kernels. Would I need to submit a separate PR for the linux mainline as well? Or would a PR to add a bundled patch in the armbian build tree be better/preferred?
  11. Well, this went way off into the woods far too quickly I'll need to spend time reviewing the upstream (linux mainline) changes to see what changed. For now 4.19.y here on a pine64+ is not suitable for PPS nor is it suitable for an NTP reference clock from what I've been able to work out so far: Changing the kernel timer (General -> timers) to a constant rate (not any dyntick configuration) results in an unstable system that ultimately freezes hard. This is only needed to enable CONFIG_NTP_PPS (IE: use "hardpps", aka kernel-discipline PPS) so will be set aside for now. Using a standard serial GPS via GPIO on a pine64 results in significant delays (as much as 1/10 second) reading the NMS sentences, with extremely high jitter. Jitter is two orders of magnitude higher (100x worse) than with 4.14.y kernel using the same configuration on the same board. This itself indicates something may not be right with this kernel on this board Without PPS, the above has no possibility of being tamed. Using the default kernel config, and simply enabling CONFIG_PPS_CLIENT_GPIO (via the standard kernel configuration 'make config' as part of compile.sh) results in a boot: [ 1.286759] pps_core: LinuxPPS API ver. 1 registered [ 1.286773] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it> [ 6.199310] sun50i-a64-pinctrl 1c20800.pinctrl: pin PH9 already requested by pps@0; cannot claim for 1c20800.pinctrl:233 [ 6.218536] pps-gpio pps@0: failed to request GPIO 233 [ 6.223806] pps-gpio: probe of pps@0 failed with error -22 No kernel pps device is ever created. Before digging in to the PPS issue, I want to review the mainline commits. Previously PPS itself was configured as a module, but now it is only configurable as a built-in so something else there may have changed. I may also give 4.20.y a try.
  12. I figured that so I'll work on it. I'll review the upstream changes to make sure there isn't a specific reason if that is where this change happened. I also now have a conflict between pinctrl and pps that didn't happen with the 4.14.y kernel so want to identify and (hopefully) correct that first.
  13. After a recent (long overdue) update on my pine64 I found that pps-gpio was no longer working. Further investigation found that the default kernel configuration with 4.19.x no longer includes CONFIG_PPS_CLIENT_GPIO (nor CONFIG_PPS_CLIENT_LDISC). Anyone wishing to use pps-gpio (via overlay) will need to configure and build a custom kernel. user@pine64:~$ cat /etc/armbian-release # PLEASE DO NOT EDIT THIS FILE BOARD=pine64 BOARD_NAME="Pine64" BOARDFAMILY=sun50iw1 VERSION=5.76.190218 LINUXFAMILY=sunxi64 BRANCH=next ARCH=arm64 IMAGE_TYPE=nightly BOARD_TYPE=conf INITRD_ARCH=arm64 KERNEL_IMAGE_TYPE=Image user@pine64:~$ uname -a Linux pine64 4.19.25-sunxi64 #5.76.190310 SMP Sun Mar 10 16:22:07 CET 2019 aarch64 GNU/Linux user@pine64:~$ gzip -d < /proc/config.gz | egrep -i pps CONFIG_PPS=y # CONFIG_PPS_DEBUG is not set # PPS clients support # CONFIG_PPS_CLIENT_KTIMER is not set # CONFIG_PPS_CLIENT_LDISC is not set # CONFIG_PPS_CLIENT_GPIO is not set # PPS generators support This appears to likely be from upstream changes in default kernel config options, though I haven't investigated this yet. Would it be possible to re-add these to the default armbian configurations (CONFIG_PPS_CLIENT_LDISC and CONFIG_PPS_CLIENT_GPIO)? I can look to submitting a pull request with the needed changes if needed.
  14. I've noted that 'armbianmonitor -u' will obfuscate some detail such as IPv4 addresses prior to uploading, but it misses similar additional detail: IPv6 addresses, specifically 'global' scope, are left intact. These are on interface configurations and in resolv.conf configurations at least. 'domain' and 'search' directives in resolv.conf are left intact. These can (would) include local domain names, which may be sensitive outside of a given environment. username is exposed in showing group membership. This may be minor but in some cases may be sensitive outside of a given environment, and perhaps can be shown anonymously such as '### Group membership of (logged in user) : <group1> <group2> <group3>"? It looks like the IPv4 obfuscation is very basic. Are there plans or thoughts to improve this? Would there be any reason to not obfuscate the additional items above, if someone wanted to give implementing this a shot?
  15. This would be a function of sysfs. You'lll note that you are able to write to /sys/class/gpio/export without error (after changing permissions). This triggers the new creation of /sys/class/gpio/gpioXX using default sysfs permissions. You would need to do the chmod after writing to the export, and not before (IE: during boot). The gpio sysfs module would need to support configuration of ownerships and permissions for new creations to change this.
  16. Hmm. Now this looks interesting (yes it appears I am talking to myself!): root@pine64:~# cat /sys/kernel/debug/gpio gpiochip1: GPIOs 0-255, parent: platform/1c20800.pinctrl, 1c20800.pinctrl: gpio-166 ( |cd ) in lo gpio-201 ( |pps-gpio ) in hi IRQ gpiochip0: GPIOs 352-383, parent: platform/1f02c00.pinctrl, 1f02c00.pinctrl: gpio-354 ( |reset ) out lo Note that 'pps-gpio' is attached to gpio-201 and not the proper gpio-233 (PH9 => gpio-233). Noting the difference is exactly 32 makes it appear the wrong pin bank is being selected ('g' instead of 'h') Now to try to find where this is going wrong. . .but if anyone else has suggestions I'm all ears Edit: Sure enough, without changing anything else I connected the PPS line to the physical pin for PG9 instead of PH9 and now see the PPS assertions. This does look like an off-by-one error so to now find where that is. . . Edit2: Found it!: in u-boot script /boot/dtb/allwinner/overlays/sun50i-a64-fixup.scr pin bank 'H' is assigned value '6' when it should be '7'.
  17. I have tried searching but there doesn't seem to be much info on PPS support. It is a standard part of modern kernels and should be pretty consistent. On a Pine64+ using mainline kernel (debian-stretch-next) I chose an interrupt-enabled pin (PH9) updated /boot/armbianEnv.txt with overlays=pps-gpio uart1 uart2 param_pps_pin=PH9 and rebooted. I have a device (a GPS) with its PPS output connected to the Pi2 bus pin 13 that corresponds to PH9 and during boot this appears to be configured as a new PPS source: [ 5.548061] pps pps0: new PPS source pps@0.-1 [ 5.548135] pps pps0: Registered IRQ 145 as PPS source /dev/pps0 is indeed created by udev, but unfortunately it never sees any assertions beyond the fist: root@pine64:~# ppstest /dev/pps0 trying PPS source "/dev/pps0" found PPS source "/dev/pps0" ok, found 1 source(s), now start fetching data... time_pps_fetch() error -1 (Connection timed out) time_pps_fetch() error -1 (Connection timed out) time_pps_fetch() error -1 (Connection timed out) ^C root@pine64:~# cat /sys/class/pps/pps0/assert 1478193403.873076084#1 I have confirmed the connected device is raising this line for 10ms every 1000ms using a scope and have confirmed this can be seen via the GPIO pin using 'cat' in a while loop to see the value change to 1 and back to 0 each second. It appears the interrupt is not being raised beyond the initial device creation, which would be the reason pulses are never seen by any clients (such as ppstest) root@pine64:~# egrep 145 /proc/interrupts 145: 1 0 0 0 sunxi_pio_edge 41 Edge pps@0.-1 Would anyone have any thoughts here? I may try the legacy kernel to see if there is any difference.