NanopiM4V2 (Rockchip RK3399) RAID with mdadm appears to write wrong/ corrupt data


frod0r
 Share

2 2
Go to solution Solved by piter75,

Recommended Posts

Hello there,

 

I am using Armbian 20.11.6 Buster with Linux 4.4.213-rk3399 on a Nanopi M4V2, I have connected 3 SATA HDDs and use mdadm (for a RAID 5 Array of those drives), luks lvm and ext4 (in that order, more details below) to store my backups.

`armbianmonitor -u` link: http://ix.io/2NYv

 

Recently I have received errors in borg backup, about mismatching checksums.

To confirm that this error is not a borg specific error, I wrote this little zsh script to write data and generate checksums.

for ((i = 0; i < 100; i++)); do dd if=/dev/urandom bs=4M count=256 | tee $i | md5sum | sed '/^a/!s/.$/'"$i"'/' >  $i.sum

To be more precise, in each loop iteration I acquire 1GiB or random data using `dd`. `tee`  writes it on the file system and to stdout which is piped in md5sum, which generates the checksum. The `sed` directive removes the `-` in the checksum file and replaces it with the appropriate filename.

I then verify the checksums like so

for ((i = 0; i < 100; i++)); do md5sum -c $i.sum; done

a lot of the checks fail.

 

My HDD structure looks as follows

nanopim4v2:~:% lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                             8:0    1 931.5G  0 disk  
└─md127                         9:127  0   1.8T  0 raid5 
  └─md0_crypt                 253:0    0   1.8T  0 crypt 
    └─vg00-lv00_nas           253:1    0   1.8T  0 lvm   /media/raid
sdb                             8:16   1 931.5G  0 disk  
└─md127                         9:127  0   1.8T  0 raid5 
  └─md0_crypt                 253:0    0   1.8T  0 crypt 
    └─vg00-lv00_nas           253:1    0   1.8T  0 lvm   /media/raid
sdc                             8:32   1   3.7T  0 disk  
├─sdc1                          8:33   1 931.5G  0 part  
│ └─md127                       9:127  0   1.8T  0 raid5 
│   └─md0_crypt               253:0    0   1.8T  0 crypt 
│     └─vg00-lv00_nas         253:1    0   1.8T  0 lvm   /media/raid
└─sdc2                          8:34   1   2.7T  0 part  
  └─sdc2_crypt                253:2    0   2.7T  0 crypt 
    └─vg01_non_redundant-lv00 253:3    0   2.7T  0 lvm   /media/non_redundant
sdd                             8:48   1  29.8G  0 disk  
└─sdd1                          8:49   1  29.5G  0 part  /home/frieder/prevfs
mmcblk0                       179:0    0  59.5G  0 disk  
└─mmcblk0p1                   179:1    0  58.9G  0 part  /
zram0                         251:0    0   1.9G  0 disk  [SWAP]
zram1                         251:1    0    50M  0 disk  /var/log

 

So as you can see I also use lvm on luks on a second partition of sdc. So I tested the scripts on sdc2 and received no errors.

According to mdadm my array is not in a broken state

nanopim4v2:~:% cat /proc/mdstat                            
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md127 : active raid5 sdc1[3] sda[1] sdb[0]
      1953260928 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/8 pages [0KB], 65536KB chunk

unused devices: <none>

 

Also of note is, that I have tried check (`mdadm --action=check /dev/md127`) and repair (`mdadm --action=repair /dev/md127`) several times.

I have then seen that I received a nonzero values (<200) in mismatchcnt (`/sys/block/md127/md/mismatch_cnt`) which I was able to get down to zero by repeatedly calling repair, but this number went up again after writing data and calling check again.

 

 

I also tested the script on other hardware (with other drives, amd64 architecture, running Arch with kernel 5.10.11) and there I never received even a single failed checksum so I have reason to believe the script does what it should.

To confirm that I don't just have a broken raid, I connected the drives to a amd64 computer running a live distribution of debian (debian-live-10.7.0-amd64-gnome.iso Kernel 4.19-lts) and repeated my checksum test. Here without fails.

I also noted that the files and files+checksums I wrote from nanopi still mismatched on amd64, and the files+checksums I wrote from amd64 still matched on nanopi. So whatever is causing this error must only affect writing.

 

 

I did not forget to check S.M.A.R.T. values, but I did not find anything suspicious there. To leave some readability I attached the values via pastebin:

sda: https://pastebin.com/S06RfAgC sdb: https://pastebin.com/5TQYbWvu sdc: https://pastebin.com/i4esRqit

 

As a final test, I booted a fresh image (Armbian_20.11.10_Nanopim4v2_buster_legacy_4.4.213_desktop.img), just installed the neccesary tools (mdadm cryptsetup and lvm2) and repeated the test.

While I did not receive as many checksum mismatches as before, from 70*1GiB blocks checked, 4 still failed.

 

 

Other possible relevant outputs:

nanopim4v2:% sudo lvm version  
  LVM version:     2.03.02(2) (2018-12-18)
  Library version: 1.02.155 (2018-12-18)
  Driver version:  4.34.0
  Configuration:   ./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --libexecdir=${prefix}/lib/aarch64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/aarch64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/aarch64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync

nanopim4v2:% sudo cryptsetup --version
cryptsetup 2.1.0

nanopim4v2:% sudo cryptsetup status  md0_crypt               
/dev/mapper/md0_crypt is active and is in use.
  type:    LUKS2
  cipher:  aes-xts-plain64
  keysize: 512 bits
  key location: dm-crypt
  device:  /dev/md127
  sector size:  512
  offset:  32768 sectors
  size:    3906489088 sectors
  mode:    read/write

nanopim4v2:% sudo mdadm -V                               
mdadm - v4.1 - 2018-10-01

 

 

As a TL;DR: I believe there is something wrong with mdadm writes on armbian rk3399 legacy buster build.

 

Can someone reproduce this behavior? If not do you have an (other) idea where this problem could be coming from, and what steps can I take to get a reliable working raid again?

Sidenote: I would prefer to stay on lts for now as wiringpi is AFAIK not supported on newer kernels but of course if it is the only way, I am willing to switch to newer kernels.

 

 

I appreciate any help and suggestions, cheers!

Link to post
Share on other sites

Donate and support the project!

7 minutes ago, frod0r said:

I would prefer to stay on lts for now as wiringpi is AFAIK not supported on newer kernels but of course if it is the only way, I am willing to switch to newer kernels.

 

WiringPI is EOL. Not supported by its authors, never officially supported by Armbian. It was as-is and its tied to vendors private kernel. Armbian community came out with an universal solution which should support modern hardware interface. Check pinned posts in this subforum: https://forum.armbian.com/forum/40-reviews-tutorials-hardware-hacks/ 

 

Private legacy kernels are way lower quality then modern one. Their main / only advantage is that they cover most if not all hardware functions. But here things stops. This kernel its maintained only by Rockchip - about its quality is pointless to lose time - a few things are added by board vendors, while community mainly adds small additions, hacks here and there. tl;dr; For things a such, modern kernel is our only hope. 

 

15 minutes ago, frod0r said:

amd64 architecture, running

 

ARM single board computers are not mainstream x86 even some and in some functions are not far. Distribution - the cheap difference (arch vs debian vs ubuntu) - plays no role in this. https://docs.armbian.com/#what-is-armbian ... which you already confirmed by testing Debian on x64. Perhaps more tests are needed, but points of possible corruptions are PCI to SATA controllers (chip or driver) and PCI implementation. My 2c. 

Link to post
Share on other sites

First of all, Thank you for your answer, I appreciate it that you took time of your day to reply in my thread.

 

Some comments:

I am aware that WiringPi is EOL, however I was not yet aware of/ had not yet looked into a replacement, thank you for the resource.

On 2/1/2021 at 4:32 PM, Igor said:

Private legacy kernels are way lower quality then modern one.

 

I am not quite sure I can follow. I installed the Buster desktop 4.4.y Image from https://www.armbian.com/nanopi-m4-v2. Does this not include an open source kernel?

On 2/1/2021 at 4:32 PM, Igor said:

Distribution - the cheap difference (arch vs debian vs ubuntu) - plays no role in this


AFAIK arch does not use the same kernel as debian, and I was not entirely sure if some of the relevant packages might be diffeent, so I mentioned that, but I agree this should play no role.

On 2/1/2021 at 4:32 PM, Igor said:

which you already confirmed by testing Debian on x64

What I confirmed (or at least what my aim to confirm was) was that the error appears on armv8 but not on amd64 (both architectures are 64-bit). And that It is not an issue with the 4.19 base kernel per se, but appears to be in this specific setting

On 2/1/2021 at 4:32 PM, Igor said:

but points of possible corruptions are PCI to SATA controllers (chip or driver) and PCI implementation

But if that was the case, would I not get higher UDMA_CRC_Error_Count smart values?

Also this conflicts with the obersvation, that the partition that is not managed by mdadm works fine with the otherwise same hardware+software setup.

 

However I also tested the Armbian_20.11.10_Nanopim4v2_buster_current_5.9.14 image and writing 70 1GiB blocks, one was faulty, so you are probably right that there is some hardware issue somewhere in my setup.

 

 

Edited by frod0r
Forgot / in a sentence that does not make sense otherwise
Link to post
Share on other sites

I did some more testing, I am pretty certain now that I can exclude hardware faults:

 

Again with fresh Armbian_20.11.10_Nanopim4v2_buster_current_5.9.14 and Armbian_20.11.10_Nanopim4v2_buster_legacy_4.4.213_desktop installations I received write errors on a drive that only has ext.4 on it (no raid or lvm layer).

I tested every sata port with 100*1GiB blocks, I only received 1 error each in two of the runs and none in the other two, but erros non the less.

 

I then also downloaded the proprietary friendlycore (to be precise rk3399-sd-friendlycore-focal-4.19-arm64-20201226.img)

I tested with the same per-sata-port method as described above and 200*1GiB on the original raid array and I received no errors at all.

 

I really am at a loss what the problem could be here

 

Link to post
Share on other sites

  • Solution
1 hour ago, frod0r said:

I really am at a loss what the problem could be here

Could you test this image with your procedure?

It should fix the issues with voltage scaling on little cores cluster that made my units unstable and fail in the first loop of "memtester 3280M".

With the fix it run 150 loops without failures.

 

 

Link to post
Share on other sites

Thank you!

I tested the image you sent and have received no errors so far.

As before I have written and tested 100*1GiB files over each of the 4 sata ports.

Next i will test writing on the raid Array

 

Just out of interest, in the github issue you describe that the limit on voltage change is required since the regulatory circuit of the m4v2 can't handle more.

I am curious whether I understood it correctly that this is not dependent of the power input method (5V usb c vs 12V ATX).

It it is affected by that I want to add the Information that the 12v atx port is what I am using with a 10A PSU)

 

Again thank you for this image, it really seems to have fixed my problem and I will post an update tomorrow when I have tested the raid as well

Link to post
Share on other sites

On 2/27/2021 at 2:18 AM, frod0r said:

whether I understood it correctly that this is not dependent of the power input method

You are right. It does not depend on the way the board is powered.

 

My electronics education is long diminished after spending years in software industry but I suppose it's either issue with rk808 pmic or the low pass filter in its output circuitry.

The former is less likely as there are quite a few rk3399/rk808 boards that do not exhibit these issues so I guess it's more likely to be the output filter issue.

Link to post
Share on other sites

Posted (edited)

Thanks for the clarification.

 

Just to add to your test cases: I have tested your image with memtester 3280M, for 15 loops (running half a day) without any errors.

I then re-tested the Armbian_20.11.10_Nanopim4v2_buster_current_5.9.14.img image with memtester 3280M and received a failure in the first loop (at Bit Flip, after ~30min of testing).

 

Edit: Oh I just saw I am a bit late with this additional test, as your pull request got merged 16hours ago. Sorry I took so long

Edited by frod0r
Link to post
Share on other sites

1 hour ago, frod0r said:

Oh I just saw I am a bit late with this additional test, as your pull request got merged 16hours ago. Sorry I took so long

No problem ;) Thanks for the testing of mdadm scenarios.
I decided this was good enough (+ my memtester tests) and also got some time to verify if the fix did not affect other boards.

 

Since it is merged it means that v21.05 should finally run stable on M4v2 in mainline... it may also be sooner if there is another revision of v21.02 ;)

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

2 2