griefman

  • Posts

    5
  • Joined

  • Last visited

Reputation Activity

  1. Like
    griefman got a reaction from hartraft in Failing to Boot   
    So, 
    after struggling for quite some time i finally was able to get things back to normal without losing data (at least i think so).
     
    The problem was apparently a broken kernel upgrade to version 5.10.43 . There were broken symlinks and not only. 
     
    In the end what solved it was the following:
    I first put the latest stable armbian on an sd card and booted with it. then i actually upgraded that image to the latest kernel and copied all version related files from /boot of the sd card to the /boot of the mmc. I also copied the 5.10.45 modules from /lib/module from the sd card to the mmc.
    Fixed all symlinks in /boot and then the device finally booted. After that it was all about reinstalling kernel headers, cleaning up wrong zfs versions and packages and rebooting frequently enough in between
     
    Hope that i didnt break too much and that this helps someone.
  2. Like
    griefman got a reaction from hartraft in Failing to Boot   
    So, 
    after struggling for quite some time i finally was able to get things back to normal without losing data (at least i think so).
     
    The problem was apparently a broken kernel upgrade to version 5.10.43 . There were broken symlinks and not only. 
     
    In the end what solved it was the following:
    I first put the latest stable armbian on an sd card and booted with it. then i actually upgraded that image to the latest kernel and copied all version related files from /boot of the sd card to the /boot of the mmc. I also copied the 5.10.45 modules from /lib/module from the sd card to the mmc.
    Fixed all symlinks in /boot and then the device finally booted. After that it was all about reinstalling kernel headers, cleaning up wrong zfs versions and packages and rebooting frequently enough in between
     
    Hope that i didnt break too much and that this helps someone.
  3. Like
    griefman got a reaction from gprovost in SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED   
    I have been following the discussion for a while and would like to report where i am at. Also I would like to request that this information is being recorded in a systematic way by Kobol, of course only if this isn't already happening. 
     
    I have the following setup:
    pool: mypool state: ONLINE scan: resilvered 12K in 00:00:00 with 0 errors on Tue Feb 9 17:22:03 2021 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0  
    Some weeks ago I noticed that the health of the zpool is DEGRADED. I checked and one device had READ errors and was marked as FAULTED. This has also resulted in storing UDMA CRC errors in the SMART stats for this drive. So, I cleared the errors and ran a scrub to see what is going on.
     
    sudo zpool scrub mypool  
     
    I monitored the scrub with
    sudo watch zpool status  
    And I saw quite quickly that all drives have started to get READ errors. SMART also reported that all drives now have UDMA CRC errors.
     
    It was clear that something bad is going on, so I contacted Kobol and the we started to debug the issue together.
     
    First I changed the SATA speed to 3Gbps by adding the following line to /boot/armbianEnv.txt
    extraargs=libata.force=3.0
     
    The results were similar, but I noticed that the errors started to show up a bit later in the scrubbing process. The UDMA CRC Errors count has increased.
     
    Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement.
     
    Then I disabled NCQ by adding the following to /boot/armbianEnv.txt
    extraargs=libata.force=noncq  
    and reverted back to SATA 6 Gbps by removing the 3 Gbps line, introduced earlier.
     
    This had a positive results and I was able to run the scrub without any errors.
     
    Then I went back and installed the old(original) cable harness again and retested - all good.
     
    While disable NCQ is having a positive impact on the errors, it is also having a negative impact on the speed and to some amount also on the disk drives' health. 
     
    I have also tried to reduce NCQs depth to 31, which is the recommended value, however this did not have any impact. 
     
    I hope that using this information Kobol will try to reproduce this issue themselves, to see if its only certain boards that are affected or if its every board. 
     
     
     
     
     
  4. Like
    griefman got a reaction from gprovost in SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED   
    I have been following the discussion for a while and would like to report where i am at. Also I would like to request that this information is being recorded in a systematic way by Kobol, of course only if this isn't already happening. 
     
    I have the following setup:
    pool: mypool state: ONLINE scan: resilvered 12K in 00:00:00 with 0 errors on Tue Feb 9 17:22:03 2021 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0  
    Some weeks ago I noticed that the health of the zpool is DEGRADED. I checked and one device had READ errors and was marked as FAULTED. This has also resulted in storing UDMA CRC errors in the SMART stats for this drive. So, I cleared the errors and ran a scrub to see what is going on.
     
    sudo zpool scrub mypool  
     
    I monitored the scrub with
    sudo watch zpool status  
    And I saw quite quickly that all drives have started to get READ errors. SMART also reported that all drives now have UDMA CRC errors.
     
    It was clear that something bad is going on, so I contacted Kobol and the we started to debug the issue together.
     
    First I changed the SATA speed to 3Gbps by adding the following line to /boot/armbianEnv.txt
    extraargs=libata.force=3.0
     
    The results were similar, but I noticed that the errors started to show up a bit later in the scrubbing process. The UDMA CRC Errors count has increased.
     
    Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement.
     
    Then I disabled NCQ by adding the following to /boot/armbianEnv.txt
    extraargs=libata.force=noncq  
    and reverted back to SATA 6 Gbps by removing the 3 Gbps line, introduced earlier.
     
    This had a positive results and I was able to run the scrub without any errors.
     
    Then I went back and installed the old(original) cable harness again and retested - all good.
     
    While disable NCQ is having a positive impact on the errors, it is also having a negative impact on the speed and to some amount also on the disk drives' health. 
     
    I have also tried to reduce NCQs depth to 31, which is the recommended value, however this did not have any impact. 
     
    I hope that using this information Kobol will try to reproduce this issue themselves, to see if its only certain boards that are affected or if its every board.