frod0r Posted February 1, 2021 Posted February 1, 2021 Hello there, I am using Armbian 20.11.6 Buster with Linux 4.4.213-rk3399 on a Nanopi M4V2, I have connected 3 SATA HDDs and use mdadm (for a RAID 5 Array of those drives), luks lvm and ext4 (in that order, more details below) to store my backups. `armbianmonitor -u` link: http://ix.io/2NYv Recently I have received errors in borg backup, about mismatching checksums. To confirm that this error is not a borg specific error, I wrote this little zsh script to write data and generate checksums. for ((i = 0; i < 100; i++)); do dd if=/dev/urandom bs=4M count=256 | tee $i | md5sum | sed '/^a/!s/.$/'"$i"'/' > $i.sum To be more precise, in each loop iteration I acquire 1GiB or random data using `dd`. `tee` writes it on the file system and to stdout which is piped in md5sum, which generates the checksum. The `sed` directive removes the `-` in the checksum file and replaces it with the appropriate filename. I then verify the checksums like so for ((i = 0; i < 100; i++)); do md5sum -c $i.sum; done a lot of the checks fail. My HDD structure looks as follows nanopim4v2:~:% lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 931.5G 0 disk └─md127 9:127 0 1.8T 0 raid5 └─md0_crypt 253:0 0 1.8T 0 crypt └─vg00-lv00_nas 253:1 0 1.8T 0 lvm /media/raid sdb 8:16 1 931.5G 0 disk └─md127 9:127 0 1.8T 0 raid5 └─md0_crypt 253:0 0 1.8T 0 crypt └─vg00-lv00_nas 253:1 0 1.8T 0 lvm /media/raid sdc 8:32 1 3.7T 0 disk ├─sdc1 8:33 1 931.5G 0 part │ └─md127 9:127 0 1.8T 0 raid5 │ └─md0_crypt 253:0 0 1.8T 0 crypt │ └─vg00-lv00_nas 253:1 0 1.8T 0 lvm /media/raid └─sdc2 8:34 1 2.7T 0 part └─sdc2_crypt 253:2 0 2.7T 0 crypt └─vg01_non_redundant-lv00 253:3 0 2.7T 0 lvm /media/non_redundant sdd 8:48 1 29.8G 0 disk └─sdd1 8:49 1 29.5G 0 part /home/frieder/prevfs mmcblk0 179:0 0 59.5G 0 disk └─mmcblk0p1 179:1 0 58.9G 0 part / zram0 251:0 0 1.9G 0 disk [SWAP] zram1 251:1 0 50M 0 disk /var/log So as you can see I also use lvm on luks on a second partition of sdc. So I tested the scripts on sdc2 and received no errors. According to mdadm my array is not in a broken state nanopim4v2:~:% cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] md127 : active raid5 sdc1[3] sda[1] sdb[0] 1953260928 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/8 pages [0KB], 65536KB chunk unused devices: <none> Also of note is, that I have tried check (`mdadm --action=check /dev/md127`) and repair (`mdadm --action=repair /dev/md127`) several times. I have then seen that I received a nonzero values (<200) in mismatchcnt (`/sys/block/md127/md/mismatch_cnt`) which I was able to get down to zero by repeatedly calling repair, but this number went up again after writing data and calling check again. I also tested the script on other hardware (with other drives, amd64 architecture, running Arch with kernel 5.10.11) and there I never received even a single failed checksum so I have reason to believe the script does what it should. To confirm that I don't just have a broken raid, I connected the drives to a amd64 computer running a live distribution of debian (debian-live-10.7.0-amd64-gnome.iso Kernel 4.19-lts) and repeated my checksum test. Here without fails. I also noted that the files and files+checksums I wrote from nanopi still mismatched on amd64, and the files+checksums I wrote from amd64 still matched on nanopi. So whatever is causing this error must only affect writing. I did not forget to check S.M.A.R.T. values, but I did not find anything suspicious there. To leave some readability I attached the values via pastebin: sda: https://pastebin.com/S06RfAgC sdb: https://pastebin.com/5TQYbWvu sdc: https://pastebin.com/i4esRqit As a final test, I booted a fresh image (Armbian_20.11.10_Nanopim4v2_buster_legacy_4.4.213_desktop.img), just installed the neccesary tools (mdadm cryptsetup and lvm2) and repeated the test. While I did not receive as many checksum mismatches as before, from 70*1GiB blocks checked, 4 still failed. Other possible relevant outputs: nanopim4v2:% sudo lvm version LVM version: 2.03.02(2) (2018-12-18) Library version: 1.02.155 (2018-12-18) Driver version: 4.34.0 Configuration: ./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --libexecdir=${prefix}/lib/aarch64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/aarch64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/aarch64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync nanopim4v2:% sudo cryptsetup --version cryptsetup 2.1.0 nanopim4v2:% sudo cryptsetup status md0_crypt /dev/mapper/md0_crypt is active and is in use. type: LUKS2 cipher: aes-xts-plain64 keysize: 512 bits key location: dm-crypt device: /dev/md127 sector size: 512 offset: 32768 sectors size: 3906489088 sectors mode: read/write nanopim4v2:% sudo mdadm -V mdadm - v4.1 - 2018-10-01 As a TL;DR: I believe there is something wrong with mdadm writes on armbian rk3399 legacy buster build. Can someone reproduce this behavior? If not do you have an (other) idea where this problem could be coming from, and what steps can I take to get a reliable working raid again? Sidenote: I would prefer to stay on lts for now as wiringpi is AFAIK not supported on newer kernels but of course if it is the only way, I am willing to switch to newer kernels. I appreciate any help and suggestions, cheers!
Igor Posted February 1, 2021 Posted February 1, 2021 7 minutes ago, frod0r said: I would prefer to stay on lts for now as wiringpi is AFAIK not supported on newer kernels but of course if it is the only way, I am willing to switch to newer kernels. WiringPI is EOL. Not supported by its authors, never officially supported by Armbian. It was as-is and its tied to vendors private kernel. Armbian community came out with an universal solution which should support modern hardware interface. Check pinned posts in this subforum: https://forum.armbian.com/forum/40-reviews-tutorials-hardware-hacks/ Private legacy kernels are way lower quality then modern one. Their main / only advantage is that they cover most if not all hardware functions. But here things stops. This kernel its maintained only by Rockchip - about its quality is pointless to lose time - a few things are added by board vendors, while community mainly adds small additions, hacks here and there. tl;dr; For things a such, modern kernel is our only hope. 15 minutes ago, frod0r said: amd64 architecture, running ARM single board computers are not mainstream x86 even some and in some functions are not far. Distribution - the cheap difference (arch vs debian vs ubuntu) - plays no role in this. https://docs.armbian.com/#what-is-armbian ... which you already confirmed by testing Debian on x64. Perhaps more tests are needed, but points of possible corruptions are PCI to SATA controllers (chip or driver) and PCI implementation. My 2c.
frod0r Posted February 3, 2021 Author Posted February 3, 2021 (edited) First of all, Thank you for your answer, I appreciate it that you took time of your day to reply in my thread. Some comments: I am aware that WiringPi is EOL, however I was not yet aware of/ had not yet looked into a replacement, thank you for the resource. On 2/1/2021 at 4:32 PM, Igor said: Private legacy kernels are way lower quality then modern one. I am not quite sure I can follow. I installed the Buster desktop 4.4.y Image from https://www.armbian.com/nanopi-m4-v2. Does this not include an open source kernel? On 2/1/2021 at 4:32 PM, Igor said: Distribution - the cheap difference (arch vs debian vs ubuntu) - plays no role in this AFAIK arch does not use the same kernel as debian, and I was not entirely sure if some of the relevant packages might be diffeent, so I mentioned that, but I agree this should play no role. On 2/1/2021 at 4:32 PM, Igor said: which you already confirmed by testing Debian on x64 What I confirmed (or at least what my aim to confirm was) was that the error appears on armv8 but not on amd64 (both architectures are 64-bit). And that It is not an issue with the 4.19 base kernel per se, but appears to be in this specific setting On 2/1/2021 at 4:32 PM, Igor said: but points of possible corruptions are PCI to SATA controllers (chip or driver) and PCI implementation But if that was the case, would I not get higher UDMA_CRC_Error_Count smart values? Also this conflicts with the obersvation, that the partition that is not managed by mdadm works fine with the otherwise same hardware+software setup. However I also tested the Armbian_20.11.10_Nanopim4v2_buster_current_5.9.14 image and writing 70 1GiB blocks, one was faulty, so you are probably right that there is some hardware issue somewhere in my setup. Edited February 3, 2021 by frod0r Forgot / in a sentence that does not make sense otherwise
frod0r Posted February 26, 2021 Author Posted February 26, 2021 I did some more testing, I am pretty certain now that I can exclude hardware faults: Again with fresh Armbian_20.11.10_Nanopim4v2_buster_current_5.9.14 and Armbian_20.11.10_Nanopim4v2_buster_legacy_4.4.213_desktop installations I received write errors on a drive that only has ext.4 on it (no raid or lvm layer). I tested every sata port with 100*1GiB blocks, I only received 1 error each in two of the runs and none in the other two, but erros non the less. I then also downloaded the proprietary friendlycore (to be precise rk3399-sd-friendlycore-focal-4.19-arm64-20201226.img) I tested with the same per-sata-port method as described above and 200*1GiB on the original raid array and I received no errors at all. I really am at a loss what the problem could be here
Solution piter75 Posted February 26, 2021 Solution Posted February 26, 2021 1 hour ago, frod0r said: I really am at a loss what the problem could be here Could you test this image with your procedure? It should fix the issues with voltage scaling on little cores cluster that made my units unstable and fail in the first loop of "memtester 3280M". With the fix it run 150 loops without failures. 1
frod0r Posted February 27, 2021 Author Posted February 27, 2021 Thank you! I tested the image you sent and have received no errors so far. As before I have written and tested 100*1GiB files over each of the 4 sata ports. Next i will test writing on the raid Array Just out of interest, in the github issue you describe that the limit on voltage change is required since the regulatory circuit of the m4v2 can't handle more. I am curious whether I understood it correctly that this is not dependent of the power input method (5V usb c vs 12V ATX). It it is affected by that I want to add the Information that the 12v atx port is what I am using with a 10A PSU) Again thank you for this image, it really seems to have fixed my problem and I will post an update tomorrow when I have tested the raid as well
frod0r Posted February 27, 2021 Author Posted February 27, 2021 I also tested the RAID array with 100 1GiB blocks and got no errors, this is great!
piter75 Posted February 28, 2021 Posted February 28, 2021 On 2/27/2021 at 2:18 AM, frod0r said: whether I understood it correctly that this is not dependent of the power input method You are right. It does not depend on the way the board is powered. My electronics education is long diminished after spending years in software industry but I suppose it's either issue with rk808 pmic or the low pass filter in its output circuitry. The former is less likely as there are quite a few rk3399/rk808 boards that do not exhibit these issues so I guess it's more likely to be the output filter issue.
frod0r Posted March 3, 2021 Author Posted March 3, 2021 (edited) Thanks for the clarification. Just to add to your test cases: I have tested your image with memtester 3280M, for 15 loops (running half a day) without any errors. I then re-tested the Armbian_20.11.10_Nanopim4v2_buster_current_5.9.14.img image with memtester 3280M and received a failure in the first loop (at Bit Flip, after ~30min of testing). Edit: Oh I just saw I am a bit late with this additional test, as your pull request got merged 16hours ago. Sorry I took so long Edited March 3, 2021 by frod0r
piter75 Posted March 3, 2021 Posted March 3, 2021 1 hour ago, frod0r said: Oh I just saw I am a bit late with this additional test, as your pull request got merged 16hours ago. Sorry I took so long No problem Thanks for the testing of mdadm scenarios. I decided this was good enough (+ my memtester tests) and also got some time to verify if the fix did not affect other boards. Since it is merged it means that v21.05 should finally run stable on M4v2 in mainline... it may also be sooner if there is another revision of v21.02
TRS-80 Posted March 4, 2021 Posted March 4, 2021 On 3/3/2021 at 2:27 PM, piter75 said: v21.05 should finally run stable on M4v2 in mainline Big if true!
Recommended Posts