griefman

  • Posts

    5
  • Joined

  • Last visited

griefman's Achievements

  1. So, after struggling for quite some time i finally was able to get things back to normal without losing data (at least i think so). The problem was apparently a broken kernel upgrade to version 5.10.43 . There were broken symlinks and not only. In the end what solved it was the following: I first put the latest stable armbian on an sd card and booted with it. then i actually upgraded that image to the latest kernel and copied all version related files from /boot of the sd card to the /boot of the mmc. I also copied the 5.10.45 modules from /lib/module from the sd card to the mmc. Fixed all symlinks in /boot and then the device finally booted. After that it was all about reinstalling kernel headers, cleaning up wrong zfs versions and packages and rebooting frequently enough in between Hope that i didnt break too much and that this helps someone.
  2. Hi, i have been struggling with a very similar issue for the past few days. I did an update roughly a week ago and did not reboot afterwards. The Helios64 continued running for some time, until some days later my files stopped being available. Then i checked what is going on and came to very similar startup logs: I used UMS mode to examine the situation, but nothing seemed to be wrong. I did fsck /dev/sdb1 -f and looks fine. When i did fsck /dev/sdb it told me that the device had bad magic number in super-block and that it found a dos partition table in /dev/sdb i tried the same fsck on /dev/sdb but with the superblocks that i found via mke2fs -n /dev/sdb and it did not show anything else. i am running out of ideas now unfortunately.... and i would really appreciate the support, if you have any... i am trying to avoid starting from scratch...
  3. I used a power cable that was provided by Kobol and brand new SATA cables from Amazon. Also, I did not have any bad sectors written on the disks, only UDMA CRC errors. Also, my drives were brand new when I connected them
  4. I have been following the discussion for a while and would like to report where i am at. Also I would like to request that this information is being recorded in a systematic way by Kobol, of course only if this isn't already happening. I have the following setup: pool: mypool state: ONLINE scan: resilvered 12K in 00:00:00 with 0 errors on Tue Feb 9 17:22:03 2021 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 Some weeks ago I noticed that the health of the zpool is DEGRADED. I checked and one device had READ errors and was marked as FAULTED. This has also resulted in storing UDMA CRC errors in the SMART stats for this drive. So, I cleared the errors and ran a scrub to see what is going on. sudo zpool scrub mypool I monitored the scrub with sudo watch zpool status And I saw quite quickly that all drives have started to get READ errors. SMART also reported that all drives now have UDMA CRC errors. It was clear that something bad is going on, so I contacted Kobol and the we started to debug the issue together. First I changed the SATA speed to 3Gbps by adding the following line to /boot/armbianEnv.txt extraargs=libata.force=3.0 The results were similar, but I noticed that the errors started to show up a bit later in the scrubbing process. The UDMA CRC Errors count has increased. Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement. Then I disabled NCQ by adding the following to /boot/armbianEnv.txt extraargs=libata.force=noncq and reverted back to SATA 6 Gbps by removing the 3 Gbps line, introduced earlier. This had a positive results and I was able to run the scrub without any errors. Then I went back and installed the old(original) cable harness again and retested - all good. While disable NCQ is having a positive impact on the errors, it is also having a negative impact on the speed and to some amount also on the disk drives' health. I have also tried to reduce NCQs depth to 31, which is the recommended value, however this did not have any impact. I hope that using this information Kobol will try to reproduce this issue themselves, to see if its only certain boards that are affected or if its every board.