Helios4 Support

alexcp · November 7, 2018

On 12/27/2017 at 2:21 AM, gprovost said:

Known Issues :

During SATA heavy load, accessing SPI NOR Flash will generate ATA errors. Temporary fix : Disable SPI NOR flash.

Hello,

Is there an easy way of disabling SPI to get rid of ATA errors?

I am running OMV4 on a pre-compiled armbian stretch. When I try backing up the RAID array, either to a locally connected USB drive using rsync, or over network using SMB, after copying a few files, I end up with ATA errors, segmentation faults, or system crashes.

gprovost · November 7, 2018

@alexcp By default the SPI NOR Flash is already disable. But just to be sure, can you do execute the following command lsblk and confirm you don't see the following block device : mtdblock0

If the SPI NOR Flash is confirmed to be disable, what you describing sounds more like a power budget issue. Can you tell me which model of HDD you are using ? Also have you tried doing your rsync over SMB without any device connected on the USB ports (in order to narrow down the issue) ?

Finally can you execute armbianmonitor -u and post the output link here. Thanks.

Edited November 7, 2018 by Igor
armbian-monitor -> armbianmonitor

alexcp · November 7, 2018

Thank you for the quick reply.

I confirm there is no mtdblock0 device listed by lsblk.
My Helios is fitted with 4x WD100EFAX HDDs, each rated for 5V/400mA, 12V/550mA. The Helios itself is powered by a 12V 8A brick.
I tried copying files over SMB with no devices connected to the USB ports, with the same result: one or a few files can be copied without issues, however and attempt to copy a folder crashes the system.
armbianmonitor -u output is here: http://ix.io/1reV

gprovost · November 8, 2018

@alexcp That's useful information. Yes we can discard power budget issue.

Ok from the armbian-monitor log I can see already 2 serious issues :

1/ HDD /dev/sdc on port SATA 3 (U12) shows a lot of READ DMA error, and even SMART command are failing. So either it's a faulty HDD, either something not good with the SATA cable. So I would advice first to try with another SATA cable to see if it could be a cable issue. If error persist then I'm afraid to say you have faulty HDD. (Note: do a proper shutdown before changing cable)

How long you have been running your rig for ? I guess your HDD are still under warranty ?

If you do a dmesg you will see a lot of the following errors that show something is wrong with the HDD.

[    8.113934] sd 2:0:0:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[    8.113939] sd 2:0:0:0: [sdc] tag#0 Sense Key : 0x3 [current] 
[    8.113943] sd 2:0:0:0: [sdc] tag#0 ASC=0x31 ASCQ=0x0 
[    8.113947] sd 2:0:0:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 04 8c 3f ff 80 00 00 00 08 00 00
[    8.113951] print_req_error: I/O error, dev sdc, sector 19532873600

[    9.005672] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[    9.005677] ata3.00: irq_stat 0x40000001
[    9.005685] ata3.00: failed command: READ DMA
[    9.005700] ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 12 dma 4096 in
                        res 53/40:08:00:00:00/00:00:00:00:00/40 Emask 0x8 (media error)
[    9.005704] ata3.00: status: { DRDY SENSE ERR }
[    9.005709] ata3.00: error: { UNC }
[    9.008370] ata3.00: configured for UDMA/133
[    9.008383] ata3: EH complete

[   60.347211] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   60.347215] ata3.00: irq_stat 0x40000001
[   60.347220] ata3.00: failed command: SMART
[   60.347228] ata3.00: cmd b0/d8:00:01:4f:c2/00:00:00:00:00/00 tag 23
                        res 53/40:00:00:00:00/00:00:00:00:00/00 Emask 0x8 (media error)
[   60.347231] ata3.00: status: { DRDY SENSE ERR }
[   60.347233] ata3.00: error: { UNC }
[   60.349912] ata3.00: configured for UDMA/133
[   60.349940] ata3: EH complete

2/ You RAID is obviously degraded, but not only because of the /dev/sdc issue describe above, /dev/sda has been removed from the array because mdadm consider it as unclean. This could be the result of ungraceful shutdown, which seems to be triggered by issue number 1.

Anyway issue with /dev/sda can be fixed. But first can you run the following command and post the output : sudo mdadm -D /dev/md127 , I need to understand how your RAID layout is affected with those issues.

[    8.054175] md: kicking non-fresh sda from array!
[    8.065216] md/raid10:md127: active with 2 out of 4 devices


NAME           FSTYPE              SIZE MOUNTPOINT UUID
sda            linux_raid_member   9.1T            16d26e7c-3c2a-eef9-ec7c-df93ca0fbfa5
sdb            linux_raid_member   9.1T            16d26e7c-3c2a-eef9-ec7c-df93ca0fbfa5
└─md127        LVM2_member        18.2T            kC0nGt-RYKe-innN-7sKk-PQHi-g9mo-r67ATF
  └─omv-public ext4               18.2T            c80cb9a5-cd2d-4dbe-8a93-af4eebe85635
sdc                                9.1T            
sdd            linux_raid_member   9.1T            16d26e7c-3c2a-eef9-ec7c-df93ca0fbfa5
└─md127        LVM2_member        18.2T            kC0nGt-RYKe-innN-7sKk-PQHi-g9mo-r67ATF
  └─omv-public ext4               18.2T            c80cb9a5-cd2d-4dbe-8a93-af4eebe85635
mmcblk0                           29.7G            
└─mmcblk0p1    ext4               29.4G /          078e5925-a184-4dc3-91fb-ff3ba64b1a81
zram0                               50M /var/log

Conclusion : This could explain the system crash. I see you have a dm-0 device, did you encrypted a partition ?

Ideally you send me by PM a copy of your log files (/var/log and /var/log.hdd).

alexcp · November 9, 2018

Lacking a spare SATA cable, I swapped sda and sdc cables. dmesg still shows errors for sdc, so it must be a faulty HDD - a first for me, ever.

mdadm -D /dev/md127 gives the following:

Quote

/dev/md127:

Version : 1.2

Creation Time : Sun Feb 4 18:42:03 2018

   Raid Level : raid10

   Array Size : 19532611584 (18627.75 GiB 20001.39 GB)

Used Dev Size : 9766305792 (9313.88 GiB 10000.70 GB)

   Raid Devices : 4

Total Devices : 2

Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Wed Nov 7 16:04:18 2018

State : clean, degraded

Active Devices : 2

Working Devices : 2

Failed Devices : 0

Spare Devices : 0

   Layout : near=2

   Chunk Size : 512K

   Name : helios4:OMV (local to host helios4)

   UUID : 16d26e7c:3c2aeef9:ec7cdf93:ca0fbfa5

   Events : 32849

Number Major Minor RaidDevice State

   - 0 0 0 removed

   1 8 16 1 active sync set-B /dev/sdb

   - 0 0 2 removed

   3 8 48 3 active sync set-B /dev/sdd

I created the encrypted partition when I was originally setting up the Helios - it was part of the setting up instructions - but I never used it.

I PMed to you the logs.

Also, I disconnected sdc and tried to rsync the RAID array to a locally connected USB drive as before. After copying a bunch of files, I got the following; this the sort of messages I was getting before:

Quote

Segmentation fault

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.340711] Internal error: Oops: 5 [#1] SMP THUMB2

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.438500] Process rsync (pid: 3050, stack limit = 0xed1da220)

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.444431] Stack: (0xed1dbd50 to 0xed1dc000)

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.448797] bd40: ed797a28 ed1dbdc4 c0713078 0000164a

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.456994] bd60: ed797a28 c0280791 e94dc4c0 ed797a28 ed1dbdc4 c0248ddf ed1dbdc4 ed1dbdc4

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.465190] bd80: ed797a28 e8f40920 ed1dbdc4 00000000 ed797a28 c025c043 95b4ce29 c0a03f88

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.473387] bda0: e8f40920 ed1dbdc4 ed186000 c025c1a1 00000000 01400040 c0a7e3c4 00000001

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.481584] bdc0: 00000000 e94dc4c0 00000000 00010160 00000001 00001702 ed1dbf08 95b4ce29

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.489780] bde0: eccae840 e8f40920 ed797a28 ed1dbe5c 00000000 ed1dbf08 e8f40a14 eccae840

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.497976] be00: 00000000 c025f70d 00000000 c0a03f88 ed1dbe20 00000801 e8f40920 c020a469

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.506172] be20: 00000001 e8f40920 00000001 ed1dbe5c 00000000 c01ff0d5 c01ff071 c0a03f88

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.514369] be40: e8f40920 ece81e10 c01ff071 c0200643 5be4e888 2ea0b032 e8f40a14 5be4e888

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.522564] be60: 2ea0b032 95b4ce29 00000000 00000029 00000000 ed1dbef0 00000000 c01a6ab5

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.530761] be80: 00000001 c074e390 e8f40920 00000001 0000226c 00000029 ffffe000 00000029

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.538957] bea0: eccae8a8 00000001 00000000 00000029 00080001 014000c0 00000004 eccae840

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.547153] bec0: 00000000 00000000 00000000 c0a03f88 ed1dbf78 00000029 00000000 c01e9529

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.555349] bee0: 00000029 00000001 01218988 00000029 00000000 00000000 00000000 ed1dbef0

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.563545] bf00: 00000000 95b4ce29 eccae840 00000000 00000029 00000000 00000000 00000000

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.571740] bf20: 00000000 00000000 00000000 95b4ce29 00000000 00000000 01218988 eccae840

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.579936] bf40: ffffe000 ed1dbf78 00000029 c01eaf13 00000000 00000000 000003e8 c0a03f88

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.588133] bf60: eccae840 00000000 00000000 eccae840 01218988 c01eb24f 00000000 00000000

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.596329] bf80: 5a7fb644 95b4ce29 01a03b98 00000029 00000029 00000003 c01065c4 ed1da000

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.604526] bfa0: 00000000 c01063c1 01a03b98 00000029 00000003 01218988 00000029 00000000

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.612722] bfc0: 01a03b98 00000029 00000029 00000003 00000000 00000000 00000000 00000000

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.620918] bfe0: 00000000 beecc234 004b82f9 b6f25a76 20000030 00000003 00000000 00000000

Message from syslogd@localhost at Nov 9 01:53:13 ...

kernel:[ 504.743011] Code: 2b00 d1d1 de02 6aa2 (6853) 3301

[ 791.473852] EXT4-fs (dm-0): error count since last fsck: 2

[ 791.479365] EXT4-fs (dm-0): initial error at time 1541609141: mb_free_blocks:1469: block 2153233564

[ 791.488465] EXT4-fs (dm-0): last error at time 1541609141: ext4_mb_generate_buddy:757

gprovost · November 9, 2018

@alexcp First, as shown on your mdadm -D /dev/md127 output command, unfortunately right now you missing half of your RAID, the set-A mirror is gone... which is very bad.

     Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       16        1      active sync set-B   /dev/sdb
       -       0        0        2      removed
       3       8       48        3      active sync set-B   /dev/sdd

Let's cross finger you /dev/sda is not too much out of sync. Can you try re-add /dev/sda to the array : mdadm --manage /dev/md127 --re-add /dev/sda

Hopefully it works. If yes, can you post again the mdadm -D /dev/md127 output here.

If it cannot be re-added, do the following command mdadm --examine /dev/sd[abcd] >> raid.status and post the raid.status file here.

alexcp · November 9, 2018

Re-add worked. Note sda is now the USB drive, so what was sda before is now sdb, etc.

$ sudo mdadm --manage /dev/md127 --re-add /dev/sdb
mdadm: re-added /dev/sdb
$ sudo mdadm -D /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Sun Feb  4 18:42:03 2018
     Raid Level : raid10
     Array Size : 19532611584 (18627.75 GiB 20001.39 GB)
  Used Dev Size : 9766305792 (9313.88 GiB 10000.70 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Nov  9 04:59:43 2018
          State : clean, degraded, recovering 
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : helios4:OMV  (local to host helios4)
           UUID : 16d26e7c:3c2aeef9:ec7cdf93:ca0fbfa5
         Events : 32891

    Number   Major   Minor   RaidDevice State
       0       8       16        0      spare rebuilding   /dev/sdb
       1       8       32        1      active sync set-B   /dev/sdc
       -       0        0        2      removed
       3       8       64        3      active sync set-B   /dev/sde

gprovost · November 9, 2018

@alexcp can you post cat /proc/mdstat ouput.

gprovost · November 9, 2018

Just to correct a wrong interpretation from my side. It seems the label set-A and set-B doesn't correspond to the mirror sets but rather to the strip sets... and even that I'm not sure, I cannot find a clear statement of what set-ABC means.

Actually Linux MD RAID10 are not really nested RAID1 in a RAID0 array like most of us would picture it, but when using the default settings (layout = near, copies = 2) while creating a MD RAID10 it would fulfill the same characteristics and guarantees than nested RAID1+0.

Layout : near=2 will write the data as follow (each chunk is repeated 2 times in a 4-way stripe array), so it's similar to a nested RAID1+0.

      | Device #0 | Device #1 | Device #2 | Device #3 |
 ------------------------------------------------------
 0x00 |     0     |     0     |     1     |     1     |
 0x01 |     2     |     2     |     3     |     3     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |

So in your case @alexcp, you were lucky that Raid Device #0 and Device #2 were still OK. It's why the state of the array was still showing clean even though degraded. So it still give you a chance to backup or reinstate you RAID array by replacing the faulty disk.

It's why it's important to setup mdadm to send alert email as soon as something goes wrong, to avoid facing the case you have issues with 2 drives at the same time. We will need to add something on our wiki to explain how to configure your OS to trigger the System Fault LED when mdadm detects some errors. Also will never repeat enough, if you have critical data on your NAS, always perform regular backup (either on external HDD or on the cloud e.g BackBlaze B2).

To be no honest, I still wonder why your system crash during rsync. Because even a degraded but clean array (faulty and unclean drives where tagged as removed) shouldn't cause such issue. So once you are done with the resyncing RaidDevice #0, try to re-do your rsync now that the faulty is physically removed.

Also I don't know what to make you of the rsync error messages you posted in your previous post yet.

alexcp · November 9, 2018

The array's re-building was completed overnight, see below. I will try rsync later today to see if the data can be copied.

$ sudo mdadm -D /dev/md127
[sudo] password for alexcp: 
/dev/md127:
        Version : 1.2
  Creation Time : Sun Feb  4 18:42:03 2018
     Raid Level : raid10
     Array Size : 19532611584 (18627.75 GiB 20001.39 GB)
  Used Dev Size : 9766305792 (9313.88 GiB 10000.70 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Nov  9 05:34:14 2018
          State : clean, degraded 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : helios4:OMV  (local to host helios4)
           UUID : 16d26e7c:3c2aeef9:ec7cdf93:ca0fbfa5
         Events : 32902

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync set-A   /dev/sdb
       1       8       32        1      active sync set-B   /dev/sdc
       -       0        0        2      removed
       3       8       64        3      active sync set-B   /dev/sde
$ cat /proc/mdstat
Personalities : [raid10] [raid0] [raid1] [raid6] [raid5] [raid4] 
md127 : active raid10 sdb[0] sde[3] sdc[1]
      19532611584 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      bitmap: 21/146 pages [84KB], 65536KB chunk

alexcp · November 9, 2018

No luck with rsync. With the faulty HDD physically disconnected, an attempt to rsync the array to either a local USB drive or a hard drive on another machine invariably ends up in segmentation fault and system crash as before.

Would I be able to access the filesystem on the RAID if I connect the HDDs to an Intel-based machine running Debian and OMV? I have a little Windows desktop with four SATA ports. I should be able to set up Debian and OMV on a USB stick and use the SATA for the array.

gprovost · November 10, 2018

@alexcp Have you tried a normal copy (cp) ? The issue you are facing now with rsync seems to be more software related.

Yup you can hook up your HDD to another rig. But why not try first with a fresh Debian install on Helios4. Prepare a new sdcard with latest Armbian Stretch release, then use following command to redetect your arrays : mdadm —assemble —scan

The new system should be able to detect your array. Check with cat /proc/mdstat the md device number and status, then mount it and try again your rsync.

If this fail again then yes try with your HDD connected to another rig.

alexcp · November 10, 2018

cp fails, as does SMB network access to the shared folders. A fresh Debian Stretch install behaves identically to the not-so-fresh, and the previously installed OMV3 on Debian Jessie, the SD card with which I still have around, shows the same "Internal error: Oops: 5 [#1] SMP THUMB2".

At this point, I tend to believe this is a hardware issue of sorts, maybe something as simple as a faulty power brick. Too bad it's not the SPI NOR flash, the solution to which is known. Oh well. Over the next few days, I will assemble another rig and will try to salvage the data through it.

@gprovost: thank you for your helping out with this issue!

gprovost · November 12, 2018

@alexcp It sounds more like a file system issue / corruption that might result of one of the kernel module crashing when accessing the file system on the array.

Have you done a fsck on your array ?

Based on your log, some file system corruption are detected on dm-0, which is your logical volume you created with LVM.

On 11/9/2018 at 9:16 AM, alexcp said:

[ 791.473852] EXT4-fs (dm-0): error count since last fsck: 2

[ 791.479365] EXT4-fs (dm-0): initial error at time 1541609141: mb_free_blocks:1469: block 2153233564

[ 791.488465] EXT4-fs (dm-0): last error at time 1541609141: ext4_mb_generate_buddy:757

Something that puzzle me is that the size of the logical volume is 18.2T, which shouldn't be possible on Helios4 since it's a 32bit architecture therefore each logical volume can only be max 16TB. So most likely this is the issue that make the kernel crash... but i don't understand how in the first place you were able to create and mount a partition that is more than 16TB.

Can you provide the output of sudo lvdisplay

alexcp · November 12, 2018

The point about the 16Tb limit is an interesting one; I remember being unable to create, via OMV, the array that would take all available physical space, and had to settle to the maximum offered by OMV. I also remember that ext4 is not limited by 16Tb; the limit is in the tools.

Here is lvdisplay:

$ sudo lvdisplay
[sudo] password for alexcp: 
  --- Logical volume ---
  LV Path                /dev/omv/public
  LV Name                public
  VG Name                omv
  LV UUID                xGyIgi-U00p-MVVv-zlz8-0quc-ZJwh-tuWvRl
  LV Write Access        read/write
  LV Creation host, time helios4, 2018-02-10 03:48:42 +0000
  LV Status              available
  # open                 0
  LV Size                18.19 TiB
  Current LE             4768703
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           254:0

gprovost · November 13, 2018

You will have to remember what operations you did exactly to setup this omv-public partition in order for us to understand what's the issue.

1. You shouldn't have been able to create a LV bigger tan 16TiB

2. You shouldn't have been able to create an ext4 partition bigger than 16TiB

But somehow you managed to create an LV > 16TiB and it looks like your force the 64bit feature on the ext4 fs in order to make it bigger than 16TB.

Maybe you can post sudo tune2fs -l /dev/mapper/omv-public (not 100% of path but should be that)

Couldn't resume better :

Quote

32 bit kernels are limited to 16 TiB because the page cache entry index is only 32 bits. This is a kernel limitation, not a filesystem limitation!

https://serverfault.com/questions/462029/unable-to-mount-a-18tb-raid-6/536758#536758

alexcp · November 14, 2018

Well, I have not been able to recover my data. Even though the RAID array was clean, the filesystem appeared damaged as you suspected. The curious 18Tb filesystem on a 32bit rig is no more, unfortunately; I cannot run any test on it anymore. The defective HDD was less than a year old and covered by a 3-year warranty, so it is on its way to the manufacturer; hopefully I will get a free replacement. I intend to keep the 4x 10Tb drives on my Intel rig and rebuild the Helios with smaller, cheaper HDDs. To me, the incident is a reminder that a RAID array is not a complete solution for data safety and must be supported by other means, e.g. cloud or tape backups.

I don't remember how I got the 18Tb filesystem. I think I created a smaller one and then resized it up after deleting the encrypted partition, even though such resizing should be impossible according to your link above. Out of curiosity I just did the following: I assembled a RAID5 array from the remaining 3x 10Tb disks and tried to create a 18Tb filesystem on it via OMV. The result was the following error message:

Failed to execute command 'export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; mkfs -V -t ext4 -b 4096 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 -L 'public' '/dev/mapper/public-public' 2>&1' with exit code '1': mke2fs 1.43.4 (31-Jan-2017) Creating filesystem with 4883151872 4k blocks and 305197056 inodes Filesystem UUID: c731d438-7ccd-4d31-9277-c91b0ea62c72 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 , 214990848 , 512000000 , 550731776 , 644972544 , 1934917632 , 2560000000 , 3855122432 Allocating group tables: 0/149022 12986/149022 done Writing inode tables: 0/149022 151/149022 332/149022  576/149022 792/149022 1016/149022 1230/149022 1436/149022 1673/149022 1881/149022  2044/149022 2265/149022 2427/149022 2650/149022  2839/149022 3056/149022 3265/149022 3479/149022 3692/149022 3904/149022 4099/149022 4281/149022  4473/149022  4677/149022 4903/149022 5116/149022  5316/149022 5510/149022  5731/149022  5944/149022 6124/149022  6326/149022 6532/149022  6727/149022 6911/149022 7139/149022  7363/149022  7559/149022 7762/149022 7988/149022  8165/149022 8389/149022  8542/149022  8771/149022  8956/149022 9176/149022  9412/149022  9584/149022  9821/149022  10044/149022  10229/149022 10451/149022  10604/149022 10848/149022  11007/149022 11247/149022  11458/149022 11645/149022  11863/149022  12089/149022  12222/149022 12448/149022 12599/149022  12851/149022 13067/149022  13285/149022 13498/149022  13678/149022  13910/149022  14134/149022 14310/149022  14532/149022 14720/149022  14883/149022 15105/149022  15331/149022 15529/149022  15721/149022  15938/149022 16145/149022 16331/149022 16531/149022  16700/149022  16896/149022 17112/149022 17349/149022  17556/149022  17740/149022  17961/149022  18441/149022  19142/149022  20066/149022 20833/149022  21761/149022  22691/149022 23678/149022  24603/149022  25474/149022 26450/149022 27313/149022 28040/149022  28955/149022  29718/149022  30651/149022  31634/149022  32616/149022  33546/149022  34431/149022  35428/149022 36336/149022  37006/149022  37914/149022  38756/149022 39715/149022  40618/149022 41524/149022  42513/149022  43327/149022  44276/149022  45244/149022 45931/149022 46805/149022  47615/149022 48607/149022 49482/149022  50439/149022  51409/149022  52248/149022  53178/149022  54142/149022  54852/149022  55694/149022  56474/149022 57439/149022 58356/149022  59295/149022 60273/149022  61095/149022  62055/149022 63023/149022 63800/149022  64632/149022  65361/149022  66349/149022  67240/149022  68126/149022  69117/149022 69982/149022  70889/149022 71863/149022  72685/149022  73535/149022  74303/149022  75243/149022  76133/149022 77117/149022  78083/149022  78898/149022  79887/149022 80862/149022  81528/149022  82420/149022  83224/149022  84208/149022  85147/149022  86074/149022  87048/149022 87950/149022  88935/149022 89841/149022  90647/149022  91464/149022  92347/149022 93320/149022  94198/149022  95183/149022 96113/149022 96948/149022  97885/149022  98828/149022  99538/149022  100419/149022  101178/149022  102152/149022 103077/149022  104004/149022  104953/149022 105853/149022  106786/149022  107665/149022  108421/149022  109334/149022  110102/149022 111051/149022  111981/149022  112963/149022  113927/149022  114782/149022  115741/149022  116710/149022 117368/149022  118196/149022  119060/149022  120021/149022  120918/149022 121906/149022  122812/149022  123652/149022  124574/149022 125546/149022  126288/149022  127130/149022  127908/149022 128888/149022  129857/149022  130758/149022 131336/149022 131526/149022  131717/149022  131922/149022  132099/149022  132279/149022  132469/149022  132724/149022 132915/149022  133151/149022  133373/149022 133567/149022  133815/149022 134032/149022  134186/149022 134382/149022 134568/149022  134776/149022  134998/149022 135214/149022  135427/149022 135636/149022  135852/149022 136072/149022  136277/149022  136465/149022 136595/149022 136797/149022  136988/149022  137231/149022 137440/149022 137635/149022  137803/149022 138004/149022 138229/149022 138384/149022  138603/149022  138809/149022 139012/149022 139242/149022  139443/149022 139634/149022  139847/149022  140057/149022  140247/149022 140449/149022  140595/149022  140794/149022 141025/149022 141250/149022  141431/149022  141631/149022  141807/149022 142042/149022  142246/149022 142446/149022 142635/149022  142836/149022  143039/149022  143247/149022 143456/149022  143677/149022 143897/149022  144102/149022 144330/149022 144498/149022  144662/149022 144919/149022  145109/149022 145338/149022 145546/149022  145739/149022 145968/149022  146173/149022  146367/149022 146529/149022 146723/149022 146922/149022 147121/149022 147331/149022  147516/149022  147711/149022 147910/149022  148119/149022 148296/149022  148483/149022  148675/149022 148884/149022   done Creating journal (262144 blocks): mkfs.ext4: Attempt to read block from filesystem resulted in short read while trying to create journal

In the end, the filesystem was not created, but the error diagnostics above is not what I expected. I remember OMV3 told me there is a limit on the size of filesystem; OMV4 did not.

Perhaps there was (is?) a hole somewhere in filesystem tools that allowed me to do a stupid thing and create a filesystem that was unsafe to use on a 32-bit system. "Short read" (see at the very end of the message above) was also the predominant mode of failure on the previous filesystem. Even so, whatever garbage I had on the RAID array should not have resulted in segmentation faults when trying to read files from the filesystem.

gprovost · November 16, 2018

Sorry to hear that you have a wipe out at the end your array. Hope your experience will remind others.

I highlighted in our wiki the 16TB partition size limit for 32bit arch.

On 11/14/2018 at 1:08 PM, alexcp said:

To me, the incident is a reminder that a RAID array is not a complete solution for data safety and must be supported by other means, e.g. cloud or tape backups.

You right, and we should have emphasized it more. I will write a page on our wiki on how to setup backup:

1. How to use rsync and cron to backup on a usb drive.

2. How to use duplicati to backup to cloud (e.g backblaze B2).

On 11/14/2018 at 1:08 PM, alexcp said:

In the end, the filesystem was not created, but the error diagnostics above is not what I expected. I remember OMV3 told me there is a limit on the size of filesystem; OMV4 did not.

Yeah I would have to check to see if OMV4 let you do this kind of mistake and will have then to highlight it to their team.

On 11/14/2018 at 1:08 PM, alexcp said:

Perhaps there was (is?) a hole somewhere in filesystem tools that allowed me to do a stupid thing and create a filesystem that was unsafe to use on a 32-bit system. "Short read" (see at the very end of the message above) was also the predominant mode of failure on the previous filesystem. Even so, whatever garbage I had on the RAID array should not have resulted in segmentation faults when trying to read files from the filesystem.

Maybe during a rsync or copy of some of your files, those files have chunks that are in a scope of block addresses that go beyond what the 32-bit kernel can handle. Therefor it's like reading an illegal memory allocation which would trigger a segfault.

Igor · November 26, 2018

@gprovost Is this https://github.com/armbian/build/commit/e71d1560f0429d9ecbc077ac457c6247735e3e9a tested enough to just rebuild images and push out an update?

gprovost · November 26, 2018

@Igor Thanks for checking with us before goign ahead ;-) Yes you can trigger now the rebuild and push out the update.

Note: we just added a commit to our u-boot 2018 repo, so in case you already had rebuilt before the last hour, you will need to re-trigger the build. Thanks.

Igor · November 26, 2018

1 hour ago, gprovost said:

@Igor Thanks for checking with us before goign ahead ;-) Yes you can trigger now the rebuild and push out the update.

Note: we just added a commit to our u-boot 2018 repo, so in case you already had rebuilt before the last hour, you will need to re-trigger the build. Thanks.

apt update and upgrade, since we reverted that boot script force upgrade due to other problems, resulted in:

Spoiler


U-Boot SPL 2018.11-armbian (Nov 26 2018 - 09:25:57 +0100)
High speed PHY - Version: 2.0
Detected Device ID 6828
board SerDes lanes topology details:
 | Lane #  | Speed |  Type       |
 --------------------------------
 |   0    |  6   |  SATA0       |
 |   1    |  5   |  USB3 HOST0  |
 |   2    |  6   |  SATA1       |
 |   3    |  6   |  SATA3       |
 |   4    |  6   |  SATA2       |
 |   5    |  5   |  USB3 HOST1  |
 --------------------------------
High speed PHY - Ended Successfully
mv_ddr: mv_ddr-armada-17.10.4 
DDR3 Training Sequence - Switching XBAR Window to FastPath Window
DDR Training Sequence - Start scrubbing
DDR3 Training Sequence - End scrubbing
mv_ddr: completed successfully
Trying to boot from MMC1


U-Boot 2018.11-armbian (Nov 26 2018 - 09:25:57 +0100)

SoC:   MV88F6828-A0 at 1600 MHz
DRAM:  2 GiB (800 MHz, 32-bit, ECC enabled)
MMC:   mv_sdh: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

Model: Helios4
Board: Helios4
SCSI:  MVEBU SATA INIT
SATA link 0 timeout.
Target spinup took 0 ms.
AHCI 0001.0000 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
flags: 64bit ncq led only pmp fbss pio slum part sxs 

Net:   
Warning: ethernet@70000 (eth1) using random MAC address - 1e:6c:e7:a2:f1:f4
eth1: ethernet@70000
Hit any key to stop autoboot:  0 
switch to partitions #0, OK
mmc0 is current device
Scanning mmc 0:1...
Found U-Boot script /boot/boot.scr
1979 bytes read in 102 ms (18.6 KiB/s)
## Executing script at 03000000
Boot script loaded from mmc
load - load binary file from a filesystem

Usage:
load <interface> [<dev[:part]> [<addr> [<filename> [bytes [pos]]]]]
    - Load binary file 'filename' from partition 'part' on device
       type 'interface' instance 'dev' to address 'addr' in memory.
      'bytes' gives the size to load in bytes.
      If 'bytes' is 0 or omitted, the file is read until the end.
      'pos' gives the file byte position to start reading from.
      If 'pos' is 0 or omitted, the file is read from the start.
load - load binary file from a filesystem

Usage:
load <interface> [<dev[:part]> [<addr> [<filename> [bytes [pos]]]]]
    - Load binary file 'filename' from partition 'part' on device
       type 'interface' instance 'dev' to address 'addr' in memory.
      'bytes' gives the size to load in bytes.
      If 'bytes' is 0 or omitted, the file is read until the end.
      'pos' gives the file byte position to start reading from.
      If 'pos' is 0 or omitted, the file is read from the start.
4712073 bytes read in 884 ms (5.1 MiB/s)
5450232 bytes read in 1041 ms (5 MiB/s)
## Loading init Ramdisk from Legacy Image at 02880000 ...
   Image Name:   uInitrd
   Created:      2018-11-26   9:01:22 UTC
   Image Type:   ARM Linux RAMDisk Image (gzip compressed)
   Data Size:    4712009 Bytes = 4.5 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK

Starting kernel ...

Uncompressing Linux... done, booting the kernel.

Error: unrecognized/unsupported machine ID (r1 = 0x00000000).

Available machine support:

ID (hex)        NAME
ffffffff        Generic DT based system
ffffffff        Marvell Armada 39x (Device Tree)
ffffffff        Marvell Armada 380/385 (Device Tree)
ffffffff        Marvell Armada 375 (Device Tree)
ffffffff        Marvell Armada 370/XP (Device Tree)
ffffffff        Marvell Dove

Please check your kernel config and/or bootloader.

I deleted u-boot package in a repository as a temporary solution. https://apt.armbian.com/pool/main/l/linux-u-boot-helios4-next/

Edit: after updating boot script, things are fine.

devman · November 26, 2018

Thanks for catching this so quickly, Igor. I was mid-update when it gave me a 404 on that file.

gprovost · November 26, 2018

@Igor We didn't foresee that sorry. I thought u-boot dpkg wasn't supposed anymore to be updated with a apt-get upgrade to avoid those specific cases where for example boot script required to be updated.

update: so now looking what would be the best approach from user pov. Quite hard without being able to do something with postinstall script.

Igor · November 26, 2018

21 minutes ago, gprovost said:

I thought u-boot dpkg wasn't supposed anymore to be updated with a apt-get upgrade to avoid those specific cases where for example boot script required to be updated.

We discussed that but implementing is another story. I removed u-boot also from index now so nobody will have problems.

22 minutes ago, gprovost said:

update: so now looking what would be the best approach from user pov. Quite hard without being able to do something with postinstall script.

One option is to choose force build script update at the build time with additional parameter? Put that code back under FORCE_BOOTSCRIPT_UPDATE="yes" ?

gprovost · November 26, 2018

15 minutes ago, Igor said:

One option is to choose force build script update at the build time with additional parameter? Put that code back under FORCE_BOOTSCRIPT_UPDATE="yes" ?

Yes I was thinking of something along those line. We will do a PR.

gprovost · November 29, 2018

@Igor we created PR 1169 to address the issue related to bootscript.

Igor · November 29, 2018

23 minutes ago, gprovost said:

@Igor we created PR 1169 to address the issue related to bootscript.

Is it safe to rebuildi and push u-boot package to the repository?

gprovost · November 29, 2018

1 hour ago, Igor said:

Is it safe to rebuildi and push u-boot package to the repository?

Yes it is, we have tested all the use cases. It's why we reverted to use boot-marvell.cmd bootscript for branch next in order to cover tricky use case like u-boot in SPI.

BTW will have to update documentation to add the new build option FORCE_BOOTSCRIPT_UPDATE.

gprovost · November 29, 2018

Hi All,

I added the following section in our wiki : https://wiki.kobol.io/mdadm/#configure-fault-led to explain how to setup mdadm to report array errors via Red Fault LED (LED2) . This way you can have a visual cue if something wrong happen with your RAID.

Zykr · December 7, 2018

I recent did an apt-upgrade which included a kernel update from armbian and now my device no longer boots.

It hangs forever with "Uncompressing Linux... done, booting the kernel."

I'm guessing that this is due to PR 1169 but I'm not sure how to fix this without re-flashing the image. (Which seems somewhat outdated by now, a new build would be appreciated.)

I'm not familiar with uboot and I can't find a manual for this specific version, so I'm quite confused at this point.

Can someone please walk me through getting my system booting again?

Sign In

Helios4 Support

Recommended Posts

alexcp

gprovost

alexcp

gprovost

alexcp

gprovost

alexcp

gprovost

gprovost

alexcp

alexcp

gprovost

alexcp

gprovost

alexcp

gprovost

alexcp

gprovost

Igor

gprovost

Igor

devman

gprovost

Igor

gprovost

gprovost

Igor

gprovost

gprovost

Zykr

Similar Content

Forums

My Activity Streams

Download

Store

Important Information