alexcp Posted November 7, 2018 Posted November 7, 2018 On 12/27/2017 at 2:21 AM, gprovost said: Known Issues : During SATA heavy load, accessing SPI NOR Flash will generate ATA errors. Temporary fix : Disable SPI NOR flash. Hello, Is there an easy way of disabling SPI to get rid of ATA errors? I am running OMV4 on a pre-compiled armbian stretch. When I try backing up the RAID array, either to a locally connected USB drive using rsync, or over network using SMB, after copying a few files, I end up with ATA errors, segmentation faults, or system crashes.
gprovost Posted November 7, 2018 Author Posted November 7, 2018 (edited) @alexcp By default the SPI NOR Flash is already disable. But just to be sure, can you do execute the following command lsblk and confirm you don't see the following block device : mtdblock0 If the SPI NOR Flash is confirmed to be disable, what you describing sounds more like a power budget issue. Can you tell me which model of HDD you are using ? Also have you tried doing your rsync over SMB without any device connected on the USB ports (in order to narrow down the issue) ? Finally can you execute armbianmonitor -u and post the output link here. Thanks. Edited November 7, 2018 by Igor armbian-monitor -> armbianmonitor
alexcp Posted November 7, 2018 Posted November 7, 2018 Thank you for the quick reply. I confirm there is no mtdblock0 device listed by lsblk. My Helios is fitted with 4x WD100EFAX HDDs, each rated for 5V/400mA, 12V/550mA. The Helios itself is powered by a 12V 8A brick. I tried copying files over SMB with no devices connected to the USB ports, with the same result: one or a few files can be copied without issues, however and attempt to copy a folder crashes the system. armbianmonitor -u output is here: http://ix.io/1reV
gprovost Posted November 8, 2018 Author Posted November 8, 2018 @alexcp That's useful information. Yes we can discard power budget issue. Ok from the armbian-monitor log I can see already 2 serious issues : 1/ HDD /dev/sdc on port SATA 3 (U12) shows a lot of READ DMA error, and even SMART command are failing. So either it's a faulty HDD, either something not good with the SATA cable. So I would advice first to try with another SATA cable to see if it could be a cable issue. If error persist then I'm afraid to say you have faulty HDD. (Note: do a proper shutdown before changing cable) How long you have been running your rig for ? I guess your HDD are still under warranty ? If you do a dmesg you will see a lot of the following errors that show something is wrong with the HDD. [ 8.113934] sd 2:0:0:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 [ 8.113939] sd 2:0:0:0: [sdc] tag#0 Sense Key : 0x3 [current] [ 8.113943] sd 2:0:0:0: [sdc] tag#0 ASC=0x31 ASCQ=0x0 [ 8.113947] sd 2:0:0:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 04 8c 3f ff 80 00 00 00 08 00 00 [ 8.113951] print_req_error: I/O error, dev sdc, sector 19532873600 [ 9.005672] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 9.005677] ata3.00: irq_stat 0x40000001 [ 9.005685] ata3.00: failed command: READ DMA [ 9.005700] ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 12 dma 4096 in res 53/40:08:00:00:00/00:00:00:00:00/40 Emask 0x8 (media error) [ 9.005704] ata3.00: status: { DRDY SENSE ERR } [ 9.005709] ata3.00: error: { UNC } [ 9.008370] ata3.00: configured for UDMA/133 [ 9.008383] ata3: EH complete [ 60.347211] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 60.347215] ata3.00: irq_stat 0x40000001 [ 60.347220] ata3.00: failed command: SMART [ 60.347228] ata3.00: cmd b0/d8:00:01:4f:c2/00:00:00:00:00/00 tag 23 res 53/40:00:00:00:00/00:00:00:00:00/00 Emask 0x8 (media error) [ 60.347231] ata3.00: status: { DRDY SENSE ERR } [ 60.347233] ata3.00: error: { UNC } [ 60.349912] ata3.00: configured for UDMA/133 [ 60.349940] ata3: EH complete 2/ You RAID is obviously degraded, but not only because of the /dev/sdc issue describe above, /dev/sda has been removed from the array because mdadm consider it as unclean. This could be the result of ungraceful shutdown, which seems to be triggered by issue number 1. Anyway issue with /dev/sda can be fixed. But first can you run the following command and post the output : sudo mdadm -D /dev/md127 , I need to understand how your RAID layout is affected with those issues. [ 8.054175] md: kicking non-fresh sda from array! [ 8.065216] md/raid10:md127: active with 2 out of 4 devices NAME FSTYPE SIZE MOUNTPOINT UUID sda linux_raid_member 9.1T 16d26e7c-3c2a-eef9-ec7c-df93ca0fbfa5 sdb linux_raid_member 9.1T 16d26e7c-3c2a-eef9-ec7c-df93ca0fbfa5 └─md127 LVM2_member 18.2T kC0nGt-RYKe-innN-7sKk-PQHi-g9mo-r67ATF └─omv-public ext4 18.2T c80cb9a5-cd2d-4dbe-8a93-af4eebe85635 sdc 9.1T sdd linux_raid_member 9.1T 16d26e7c-3c2a-eef9-ec7c-df93ca0fbfa5 └─md127 LVM2_member 18.2T kC0nGt-RYKe-innN-7sKk-PQHi-g9mo-r67ATF └─omv-public ext4 18.2T c80cb9a5-cd2d-4dbe-8a93-af4eebe85635 mmcblk0 29.7G └─mmcblk0p1 ext4 29.4G / 078e5925-a184-4dc3-91fb-ff3ba64b1a81 zram0 50M /var/log Conclusion : This could explain the system crash. I see you have a dm-0 device, did you encrypted a partition ? Ideally you send me by PM a copy of your log files (/var/log and /var/log.hdd).
alexcp Posted November 9, 2018 Posted November 9, 2018 Lacking a spare SATA cable, I swapped sda and sdc cables. dmesg still shows errors for sdc, so it must be a faulty HDD - a first for me, ever. mdadm -D /dev/md127 gives the following: Quote /dev/md127: Version : 1.2 Creation Time : Sun Feb 4 18:42:03 2018 Raid Level : raid10 Array Size : 19532611584 (18627.75 GiB 20001.39 GB) Used Dev Size : 9766305792 (9313.88 GiB 10000.70 GB) Raid Devices : 4 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Nov 7 16:04:18 2018 State : clean, degraded Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : helios4:OMV (local to host helios4) UUID : 16d26e7c:3c2aeef9:ec7cdf93:ca0fbfa5 Events : 32849 Number Major Minor RaidDevice State - 0 0 0 removed 1 8 16 1 active sync set-B /dev/sdb - 0 0 2 removed 3 8 48 3 active sync set-B /dev/sdd I created the encrypted partition when I was originally setting up the Helios - it was part of the setting up instructions - but I never used it. I PMed to you the logs. Also, I disconnected sdc and tried to rsync the RAID array to a locally connected USB drive as before. After copying a bunch of files, I got the following; this the sort of messages I was getting before: Quote Segmentation fault Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.340711] Internal error: Oops: 5 [#1] SMP THUMB2 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.438500] Process rsync (pid: 3050, stack limit = 0xed1da220) Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.444431] Stack: (0xed1dbd50 to 0xed1dc000) Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.448797] bd40: ed797a28 ed1dbdc4 c0713078 0000164a Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.456994] bd60: ed797a28 c0280791 e94dc4c0 ed797a28 ed1dbdc4 c0248ddf ed1dbdc4 ed1dbdc4 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.465190] bd80: ed797a28 e8f40920 ed1dbdc4 00000000 ed797a28 c025c043 95b4ce29 c0a03f88 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.473387] bda0: e8f40920 ed1dbdc4 ed186000 c025c1a1 00000000 01400040 c0a7e3c4 00000001 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.481584] bdc0: 00000000 e94dc4c0 00000000 00010160 00000001 00001702 ed1dbf08 95b4ce29 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.489780] bde0: eccae840 e8f40920 ed797a28 ed1dbe5c 00000000 ed1dbf08 e8f40a14 eccae840 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.497976] be00: 00000000 c025f70d 00000000 c0a03f88 ed1dbe20 00000801 e8f40920 c020a469 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.506172] be20: 00000001 e8f40920 00000001 ed1dbe5c 00000000 c01ff0d5 c01ff071 c0a03f88 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.514369] be40: e8f40920 ece81e10 c01ff071 c0200643 5be4e888 2ea0b032 e8f40a14 5be4e888 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.522564] be60: 2ea0b032 95b4ce29 00000000 00000029 00000000 ed1dbef0 00000000 c01a6ab5 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.530761] be80: 00000001 c074e390 e8f40920 00000001 0000226c 00000029 ffffe000 00000029 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.538957] bea0: eccae8a8 00000001 00000000 00000029 00080001 014000c0 00000004 eccae840 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.547153] bec0: 00000000 00000000 00000000 c0a03f88 ed1dbf78 00000029 00000000 c01e9529 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.555349] bee0: 00000029 00000001 01218988 00000029 00000000 00000000 00000000 ed1dbef0 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.563545] bf00: 00000000 95b4ce29 eccae840 00000000 00000029 00000000 00000000 00000000 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.571740] bf20: 00000000 00000000 00000000 95b4ce29 00000000 00000000 01218988 eccae840 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.579936] bf40: ffffe000 ed1dbf78 00000029 c01eaf13 00000000 00000000 000003e8 c0a03f88 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.588133] bf60: eccae840 00000000 00000000 eccae840 01218988 c01eb24f 00000000 00000000 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.596329] bf80: 5a7fb644 95b4ce29 01a03b98 00000029 00000029 00000003 c01065c4 ed1da000 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.604526] bfa0: 00000000 c01063c1 01a03b98 00000029 00000003 01218988 00000029 00000000 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.612722] bfc0: 01a03b98 00000029 00000029 00000003 00000000 00000000 00000000 00000000 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.620918] bfe0: 00000000 beecc234 004b82f9 b6f25a76 20000030 00000003 00000000 00000000 Message from syslogd@localhost at Nov 9 01:53:13 ... kernel:[ 504.743011] Code: 2b00 d1d1 de02 6aa2 (6853) 3301 [ 791.473852] EXT4-fs (dm-0): error count since last fsck: 2 [ 791.479365] EXT4-fs (dm-0): initial error at time 1541609141: mb_free_blocks:1469: block 2153233564 [ 791.488465] EXT4-fs (dm-0): last error at time 1541609141: ext4_mb_generate_buddy:757
gprovost Posted November 9, 2018 Author Posted November 9, 2018 @alexcp First, as shown on your mdadm -D /dev/md127 output command, unfortunately right now you missing half of your RAID, the set-A mirror is gone... which is very bad. Number Major Minor RaidDevice State - 0 0 0 removed 1 8 16 1 active sync set-B /dev/sdb - 0 0 2 removed 3 8 48 3 active sync set-B /dev/sdd Let's cross finger you /dev/sda is not too much out of sync. Can you try re-add /dev/sda to the array : mdadm --manage /dev/md127 --re-add /dev/sda Hopefully it works. If yes, can you post again the mdadm -D /dev/md127 output here. If it cannot be re-added, do the following command mdadm --examine /dev/sd[abcd] >> raid.status and post the raid.status file here.
alexcp Posted November 9, 2018 Posted November 9, 2018 Re-add worked. Note sda is now the USB drive, so what was sda before is now sdb, etc. $ sudo mdadm --manage /dev/md127 --re-add /dev/sdb mdadm: re-added /dev/sdb $ sudo mdadm -D /dev/md127 /dev/md127: Version : 1.2 Creation Time : Sun Feb 4 18:42:03 2018 Raid Level : raid10 Array Size : 19532611584 (18627.75 GiB 20001.39 GB) Used Dev Size : 9766305792 (9313.88 GiB 10000.70 GB) Raid Devices : 4 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Nov 9 04:59:43 2018 State : clean, degraded, recovering Active Devices : 2 Working Devices : 3 Failed Devices : 0 Spare Devices : 1 Layout : near=2 Chunk Size : 512K Rebuild Status : 0% complete Name : helios4:OMV (local to host helios4) UUID : 16d26e7c:3c2aeef9:ec7cdf93:ca0fbfa5 Events : 32891 Number Major Minor RaidDevice State 0 8 16 0 spare rebuilding /dev/sdb 1 8 32 1 active sync set-B /dev/sdc - 0 0 2 removed 3 8 64 3 active sync set-B /dev/sde
gprovost Posted November 9, 2018 Author Posted November 9, 2018 @alexcp can you post cat /proc/mdstat ouput.
gprovost Posted November 9, 2018 Author Posted November 9, 2018 Just to correct a wrong interpretation from my side. It seems the label set-A and set-B doesn't correspond to the mirror sets but rather to the strip sets... and even that I'm not sure, I cannot find a clear statement of what set-ABC means. Actually Linux MD RAID10 are not really nested RAID1 in a RAID0 array like most of us would picture it, but when using the default settings (layout = near, copies = 2) while creating a MD RAID10 it would fulfill the same characteristics and guarantees than nested RAID1+0. Layout : near=2 will write the data as follow (each chunk is repeated 2 times in a 4-way stripe array), so it's similar to a nested RAID1+0. | Device #0 | Device #1 | Device #2 | Device #3 | ------------------------------------------------------ 0x00 | 0 | 0 | 1 | 1 | 0x01 | 2 | 2 | 3 | 3 | : | : | : | : | : | : | : | : | : | : | So in your case @alexcp, you were lucky that Raid Device #0 and Device #2 were still OK. It's why the state of the array was still showing clean even though degraded. So it still give you a chance to backup or reinstate you RAID array by replacing the faulty disk. It's why it's important to setup mdadm to send alert email as soon as something goes wrong, to avoid facing the case you have issues with 2 drives at the same time. We will need to add something on our wiki to explain how to configure your OS to trigger the System Fault LED when mdadm detects some errors. Also will never repeat enough, if you have critical data on your NAS, always perform regular backup (either on external HDD or on the cloud e.g BackBlaze B2). To be no honest, I still wonder why your system crash during rsync. Because even a degraded but clean array (faulty and unclean drives where tagged as removed) shouldn't cause such issue. So once you are done with the resyncing RaidDevice #0, try to re-do your rsync now that the faulty is physically removed. Also I don't know what to make you of the rsync error messages you posted in your previous post yet.
alexcp Posted November 9, 2018 Posted November 9, 2018 The array's re-building was completed overnight, see below. I will try rsync later today to see if the data can be copied. $ sudo mdadm -D /dev/md127 [sudo] password for alexcp: /dev/md127: Version : 1.2 Creation Time : Sun Feb 4 18:42:03 2018 Raid Level : raid10 Array Size : 19532611584 (18627.75 GiB 20001.39 GB) Used Dev Size : 9766305792 (9313.88 GiB 10000.70 GB) Raid Devices : 4 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Nov 9 05:34:14 2018 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : helios4:OMV (local to host helios4) UUID : 16d26e7c:3c2aeef9:ec7cdf93:ca0fbfa5 Events : 32902 Number Major Minor RaidDevice State 0 8 16 0 active sync set-A /dev/sdb 1 8 32 1 active sync set-B /dev/sdc - 0 0 2 removed 3 8 64 3 active sync set-B /dev/sde $ cat /proc/mdstat Personalities : [raid10] [raid0] [raid1] [raid6] [raid5] [raid4] md127 : active raid10 sdb[0] sde[3] sdc[1] 19532611584 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] bitmap: 21/146 pages [84KB], 65536KB chunk
alexcp Posted November 9, 2018 Posted November 9, 2018 No luck with rsync. With the faulty HDD physically disconnected, an attempt to rsync the array to either a local USB drive or a hard drive on another machine invariably ends up in segmentation fault and system crash as before. Would I be able to access the filesystem on the RAID if I connect the HDDs to an Intel-based machine running Debian and OMV? I have a little Windows desktop with four SATA ports. I should be able to set up Debian and OMV on a USB stick and use the SATA for the array.
gprovost Posted November 10, 2018 Author Posted November 10, 2018 @alexcp Have you tried a normal copy (cp) ? The issue you are facing now with rsync seems to be more software related. Yup you can hook up your HDD to another rig. But why not try first with a fresh Debian install on Helios4. Prepare a new sdcard with latest Armbian Stretch release, then use following command to redetect your arrays : mdadm —assemble —scan The new system should be able to detect your array. Check with cat /proc/mdstat the md device number and status, then mount it and try again your rsync. If this fail again then yes try with your HDD connected to another rig.
alexcp Posted November 10, 2018 Posted November 10, 2018 cp fails, as does SMB network access to the shared folders. A fresh Debian Stretch install behaves identically to the not-so-fresh, and the previously installed OMV3 on Debian Jessie, the SD card with which I still have around, shows the same "Internal error: Oops: 5 [#1] SMP THUMB2". At this point, I tend to believe this is a hardware issue of sorts, maybe something as simple as a faulty power brick. Too bad it's not the SPI NOR flash, the solution to which is known. Oh well. Over the next few days, I will assemble another rig and will try to salvage the data through it. @gprovost: thank you for your helping out with this issue!
gprovost Posted November 12, 2018 Author Posted November 12, 2018 @alexcp It sounds more like a file system issue / corruption that might result of one of the kernel module crashing when accessing the file system on the array. Have you done a fsck on your array ? Based on your log, some file system corruption are detected on dm-0, which is your logical volume you created with LVM. On 11/9/2018 at 9:16 AM, alexcp said: [ 791.473852] EXT4-fs (dm-0): error count since last fsck: 2 [ 791.479365] EXT4-fs (dm-0): initial error at time 1541609141: mb_free_blocks:1469: block 2153233564 [ 791.488465] EXT4-fs (dm-0): last error at time 1541609141: ext4_mb_generate_buddy:757 Something that puzzle me is that the size of the logical volume is 18.2T, which shouldn't be possible on Helios4 since it's a 32bit architecture therefore each logical volume can only be max 16TB. So most likely this is the issue that make the kernel crash... but i don't understand how in the first place you were able to create and mount a partition that is more than 16TB. Can you provide the output of sudo lvdisplay
alexcp Posted November 12, 2018 Posted November 12, 2018 The point about the 16Tb limit is an interesting one; I remember being unable to create, via OMV, the array that would take all available physical space, and had to settle to the maximum offered by OMV. I also remember that ext4 is not limited by 16Tb; the limit is in the tools. Here is lvdisplay: $ sudo lvdisplay [sudo] password for alexcp: --- Logical volume --- LV Path /dev/omv/public LV Name public VG Name omv LV UUID xGyIgi-U00p-MVVv-zlz8-0quc-ZJwh-tuWvRl LV Write Access read/write LV Creation host, time helios4, 2018-02-10 03:48:42 +0000 LV Status available # open 0 LV Size 18.19 TiB Current LE 4768703 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 254:0
gprovost Posted November 13, 2018 Author Posted November 13, 2018 You will have to remember what operations you did exactly to setup this omv-public partition in order for us to understand what's the issue. 1. You shouldn't have been able to create a LV bigger tan 16TiB 2. You shouldn't have been able to create an ext4 partition bigger than 16TiB But somehow you managed to create an LV > 16TiB and it looks like your force the 64bit feature on the ext4 fs in order to make it bigger than 16TB. Maybe you can post sudo tune2fs -l /dev/mapper/omv-public (not 100% of path but should be that) Couldn't resume better : Quote 32 bit kernels are limited to 16 TiB because the page cache entry index is only 32 bits. This is a kernel limitation, not a filesystem limitation! https://serverfault.com/questions/462029/unable-to-mount-a-18tb-raid-6/536758#536758
alexcp Posted November 14, 2018 Posted November 14, 2018 Well, I have not been able to recover my data. Even though the RAID array was clean, the filesystem appeared damaged as you suspected. The curious 18Tb filesystem on a 32bit rig is no more, unfortunately; I cannot run any test on it anymore. The defective HDD was less than a year old and covered by a 3-year warranty, so it is on its way to the manufacturer; hopefully I will get a free replacement. I intend to keep the 4x 10Tb drives on my Intel rig and rebuild the Helios with smaller, cheaper HDDs. To me, the incident is a reminder that a RAID array is not a complete solution for data safety and must be supported by other means, e.g. cloud or tape backups. I don't remember how I got the 18Tb filesystem. I think I created a smaller one and then resized it up after deleting the encrypted partition, even though such resizing should be impossible according to your link above. Out of curiosity I just did the following: I assembled a RAID5 array from the remaining 3x 10Tb disks and tried to create a 18Tb filesystem on it via OMV. The result was the following error message: Failed to execute command 'export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; mkfs -V -t ext4 -b 4096 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 -L 'public' '/dev/mapper/public-public' 2>&1' with exit code '1': mke2fs 1.43.4 (31-Jan-2017) Creating filesystem with 4883151872 4k blocks and 305197056 inodes Filesystem UUID: c731d438-7ccd-4d31-9277-c91b0ea62c72 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 , 214990848 , 512000000 , 550731776 , 644972544 , 1934917632 , 2560000000 , 3855122432 Allocating group tables: 0/149022 12986/149022 done Writing inode tables: 0/149022 151/149022 332/149022 576/149022 792/149022 1016/149022 1230/149022 1436/149022 1673/149022 1881/149022 2044/149022 2265/149022 2427/149022 2650/149022 2839/149022 3056/149022 3265/149022 3479/149022 3692/149022 3904/149022 4099/149022 4281/149022 4473/149022 4677/149022 4903/149022 5116/149022 5316/149022 5510/149022 5731/149022 5944/149022 6124/149022 6326/149022 6532/149022 6727/149022 6911/149022 7139/149022 7363/149022 7559/149022 7762/149022 7988/149022 8165/149022 8389/149022 8542/149022 8771/149022 8956/149022 9176/149022 9412/149022 9584/149022 9821/149022 10044/149022 10229/149022 10451/149022 10604/149022 10848/149022 11007/149022 11247/149022 11458/149022 11645/149022 11863/149022 12089/149022 12222/149022 12448/149022 12599/149022 12851/149022 13067/149022 13285/149022 13498/149022 13678/149022 13910/149022 14134/149022 14310/149022 14532/149022 14720/149022 14883/149022 15105/149022 15331/149022 15529/149022 15721/149022 15938/149022 16145/149022 16331/149022 16531/149022 16700/149022 16896/149022 17112/149022 17349/149022 17556/149022 17740/149022 17961/149022 18441/149022 19142/149022 20066/149022 20833/149022 21761/149022 22691/149022 23678/149022 24603/149022 25474/149022 26450/149022 27313/149022 28040/149022 28955/149022 29718/149022 30651/149022 31634/149022 32616/149022 33546/149022 34431/149022 35428/149022 36336/149022 37006/149022 37914/149022 38756/149022 39715/149022 40618/149022 41524/149022 42513/149022 43327/149022 44276/149022 45244/149022 45931/149022 46805/149022 47615/149022 48607/149022 49482/149022 50439/149022 51409/149022 52248/149022 53178/149022 54142/149022 54852/149022 55694/149022 56474/149022 57439/149022 58356/149022 59295/149022 60273/149022 61095/149022 62055/149022 63023/149022 63800/149022 64632/149022 65361/149022 66349/149022 67240/149022 68126/149022 69117/149022 69982/149022 70889/149022 71863/149022 72685/149022 73535/149022 74303/149022 75243/149022 76133/149022 77117/149022 78083/149022 78898/149022 79887/149022 80862/149022 81528/149022 82420/149022 83224/149022 84208/149022 85147/149022 86074/149022 87048/149022 87950/149022 88935/149022 89841/149022 90647/149022 91464/149022 92347/149022 93320/149022 94198/149022 95183/149022 96113/149022 96948/149022 97885/149022 98828/149022 99538/149022 100419/149022 101178/149022 102152/149022 103077/149022 104004/149022 104953/149022 105853/149022 106786/149022 107665/149022 108421/149022 109334/149022 110102/149022 111051/149022 111981/149022 112963/149022 113927/149022 114782/149022 115741/149022 116710/149022 117368/149022 118196/149022 119060/149022 120021/149022 120918/149022 121906/149022 122812/149022 123652/149022 124574/149022 125546/149022 126288/149022 127130/149022 127908/149022 128888/149022 129857/149022 130758/149022 131336/149022 131526/149022 131717/149022 131922/149022 132099/149022 132279/149022 132469/149022 132724/149022 132915/149022 133151/149022 133373/149022 133567/149022 133815/149022 134032/149022 134186/149022 134382/149022 134568/149022 134776/149022 134998/149022 135214/149022 135427/149022 135636/149022 135852/149022 136072/149022 136277/149022 136465/149022 136595/149022 136797/149022 136988/149022 137231/149022 137440/149022 137635/149022 137803/149022 138004/149022 138229/149022 138384/149022 138603/149022 138809/149022 139012/149022 139242/149022 139443/149022 139634/149022 139847/149022 140057/149022 140247/149022 140449/149022 140595/149022 140794/149022 141025/149022 141250/149022 141431/149022 141631/149022 141807/149022 142042/149022 142246/149022 142446/149022 142635/149022 142836/149022 143039/149022 143247/149022 143456/149022 143677/149022 143897/149022 144102/149022 144330/149022 144498/149022 144662/149022 144919/149022 145109/149022 145338/149022 145546/149022 145739/149022 145968/149022 146173/149022 146367/149022 146529/149022 146723/149022 146922/149022 147121/149022 147331/149022 147516/149022 147711/149022 147910/149022 148119/149022 148296/149022 148483/149022 148675/149022 148884/149022 done Creating journal (262144 blocks): mkfs.ext4: Attempt to read block from filesystem resulted in short read while trying to create journal In the end, the filesystem was not created, but the error diagnostics above is not what I expected. I remember OMV3 told me there is a limit on the size of filesystem; OMV4 did not. Perhaps there was (is?) a hole somewhere in filesystem tools that allowed me to do a stupid thing and create a filesystem that was unsafe to use on a 32-bit system. "Short read" (see at the very end of the message above) was also the predominant mode of failure on the previous filesystem. Even so, whatever garbage I had on the RAID array should not have resulted in segmentation faults when trying to read files from the filesystem.
gprovost Posted November 16, 2018 Author Posted November 16, 2018 Sorry to hear that you have a wipe out at the end your array. Hope your experience will remind others. I highlighted in our wiki the 16TB partition size limit for 32bit arch. On 11/14/2018 at 1:08 PM, alexcp said: To me, the incident is a reminder that a RAID array is not a complete solution for data safety and must be supported by other means, e.g. cloud or tape backups. You right, and we should have emphasized it more. I will write a page on our wiki on how to setup backup: 1. How to use rsync and cron to backup on a usb drive. 2. How to use duplicati to backup to cloud (e.g backblaze B2). On 11/14/2018 at 1:08 PM, alexcp said: In the end, the filesystem was not created, but the error diagnostics above is not what I expected. I remember OMV3 told me there is a limit on the size of filesystem; OMV4 did not. Yeah I would have to check to see if OMV4 let you do this kind of mistake and will have then to highlight it to their team. On 11/14/2018 at 1:08 PM, alexcp said: Perhaps there was (is?) a hole somewhere in filesystem tools that allowed me to do a stupid thing and create a filesystem that was unsafe to use on a 32-bit system. "Short read" (see at the very end of the message above) was also the predominant mode of failure on the previous filesystem. Even so, whatever garbage I had on the RAID array should not have resulted in segmentation faults when trying to read files from the filesystem. Maybe during a rsync or copy of some of your files, those files have chunks that are in a scope of block addresses that go beyond what the 32-bit kernel can handle. Therefor it's like reading an illegal memory allocation which would trigger a segfault.
Igor Posted November 26, 2018 Posted November 26, 2018 @gprovost Is this https://github.com/armbian/build/commit/e71d1560f0429d9ecbc077ac457c6247735e3e9a tested enough to just rebuild images and push out an update?
gprovost Posted November 26, 2018 Author Posted November 26, 2018 @Igor Thanks for checking with us before goign ahead ;-) Yes you can trigger now the rebuild and push out the update. Note: we just added a commit to our u-boot 2018 repo, so in case you already had rebuilt before the last hour, you will need to re-trigger the build. Thanks.
Igor Posted November 26, 2018 Posted November 26, 2018 1 hour ago, gprovost said: @Igor Thanks for checking with us before goign ahead ;-) Yes you can trigger now the rebuild and push out the update. Note: we just added a commit to our u-boot 2018 repo, so in case you already had rebuilt before the last hour, you will need to re-trigger the build. Thanks. apt update and upgrade, since we reverted that boot script force upgrade due to other problems, resulted in: Spoiler U-Boot SPL 2018.11-armbian (Nov 26 2018 - 09:25:57 +0100) High speed PHY - Version: 2.0 Detected Device ID 6828 board SerDes lanes topology details: | Lane # | Speed | Type | -------------------------------- | 0 | 6 | SATA0 | | 1 | 5 | USB3 HOST0 | | 2 | 6 | SATA1 | | 3 | 6 | SATA3 | | 4 | 6 | SATA2 | | 5 | 5 | USB3 HOST1 | -------------------------------- High speed PHY - Ended Successfully mv_ddr: mv_ddr-armada-17.10.4 DDR3 Training Sequence - Switching XBAR Window to FastPath Window DDR Training Sequence - Start scrubbing DDR3 Training Sequence - End scrubbing mv_ddr: completed successfully Trying to boot from MMC1 U-Boot 2018.11-armbian (Nov 26 2018 - 09:25:57 +0100) SoC: MV88F6828-A0 at 1600 MHz DRAM: 2 GiB (800 MHz, 32-bit, ECC enabled) MMC: mv_sdh: 0 Loading Environment from MMC... *** Warning - bad CRC, using default environment Model: Helios4 Board: Helios4 SCSI: MVEBU SATA INIT SATA link 0 timeout. Target spinup took 0 ms. AHCI 0001.0000 32 slots 2 ports 6 Gbps 0x3 impl SATA mode flags: 64bit ncq led only pmp fbss pio slum part sxs Net: Warning: ethernet@70000 (eth1) using random MAC address - 1e:6c:e7:a2:f1:f4 eth1: ethernet@70000 Hit any key to stop autoboot: 0 switch to partitions #0, OK mmc0 is current device Scanning mmc 0:1... Found U-Boot script /boot/boot.scr 1979 bytes read in 102 ms (18.6 KiB/s) ## Executing script at 03000000 Boot script loaded from mmc load - load binary file from a filesystem Usage: load <interface> [<dev[:part]> [<addr> [<filename> [bytes [pos]]]]] - Load binary file 'filename' from partition 'part' on device type 'interface' instance 'dev' to address 'addr' in memory. 'bytes' gives the size to load in bytes. If 'bytes' is 0 or omitted, the file is read until the end. 'pos' gives the file byte position to start reading from. If 'pos' is 0 or omitted, the file is read from the start. load - load binary file from a filesystem Usage: load <interface> [<dev[:part]> [<addr> [<filename> [bytes [pos]]]]] - Load binary file 'filename' from partition 'part' on device type 'interface' instance 'dev' to address 'addr' in memory. 'bytes' gives the size to load in bytes. If 'bytes' is 0 or omitted, the file is read until the end. 'pos' gives the file byte position to start reading from. If 'pos' is 0 or omitted, the file is read from the start. 4712073 bytes read in 884 ms (5.1 MiB/s) 5450232 bytes read in 1041 ms (5 MiB/s) ## Loading init Ramdisk from Legacy Image at 02880000 ... Image Name: uInitrd Created: 2018-11-26 9:01:22 UTC Image Type: ARM Linux RAMDisk Image (gzip compressed) Data Size: 4712009 Bytes = 4.5 MiB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK Starting kernel ... Uncompressing Linux... done, booting the kernel. Error: unrecognized/unsupported machine ID (r1 = 0x00000000). Available machine support: ID (hex) NAME ffffffff Generic DT based system ffffffff Marvell Armada 39x (Device Tree) ffffffff Marvell Armada 380/385 (Device Tree) ffffffff Marvell Armada 375 (Device Tree) ffffffff Marvell Armada 370/XP (Device Tree) ffffffff Marvell Dove Please check your kernel config and/or bootloader. I deleted u-boot package in a repository as a temporary solution. https://apt.armbian.com/pool/main/l/linux-u-boot-helios4-next/ Edit: after updating boot script, things are fine.
devman Posted November 26, 2018 Posted November 26, 2018 Thanks for catching this so quickly, Igor. I was mid-update when it gave me a 404 on that file.
gprovost Posted November 26, 2018 Author Posted November 26, 2018 @Igor We didn't foresee that sorry. I thought u-boot dpkg wasn't supposed anymore to be updated with a apt-get upgrade to avoid those specific cases where for example boot script required to be updated. update: so now looking what would be the best approach from user pov. Quite hard without being able to do something with postinstall script.
Igor Posted November 26, 2018 Posted November 26, 2018 21 minutes ago, gprovost said: I thought u-boot dpkg wasn't supposed anymore to be updated with a apt-get upgrade to avoid those specific cases where for example boot script required to be updated. We discussed that but implementing is another story. I removed u-boot also from index now so nobody will have problems. 22 minutes ago, gprovost said: update: so now looking what would be the best approach from user pov. Quite hard without being able to do something with postinstall script. One option is to choose force build script update at the build time with additional parameter? Put that code back under FORCE_BOOTSCRIPT_UPDATE="yes" ?
gprovost Posted November 26, 2018 Author Posted November 26, 2018 15 minutes ago, Igor said: One option is to choose force build script update at the build time with additional parameter? Put that code back under FORCE_BOOTSCRIPT_UPDATE="yes" ? Yes I was thinking of something along those line. We will do a PR.
gprovost Posted November 29, 2018 Author Posted November 29, 2018 @Igor we created PR 1169 to address the issue related to bootscript.
Igor Posted November 29, 2018 Posted November 29, 2018 23 minutes ago, gprovost said: @Igor we created PR 1169 to address the issue related to bootscript. Is it safe to rebuildi and push u-boot package to the repository?
gprovost Posted November 29, 2018 Author Posted November 29, 2018 1 hour ago, Igor said: Is it safe to rebuildi and push u-boot package to the repository? Yes it is, we have tested all the use cases. It's why we reverted to use boot-marvell.cmd bootscript for branch next in order to cover tricky use case like u-boot in SPI. BTW will have to update documentation to add the new build option FORCE_BOOTSCRIPT_UPDATE.
gprovost Posted November 29, 2018 Author Posted November 29, 2018 Hi All, I added the following section in our wiki : https://wiki.kobol.io/mdadm/#configure-fault-led to explain how to setup mdadm to report array errors via Red Fault LED (LED2) . This way you can have a visual cue if something wrong happen with your RAID.
Zykr Posted December 7, 2018 Posted December 7, 2018 I recent did an apt-upgrade which included a kernel update from armbian and now my device no longer boots. It hangs forever with "Uncompressing Linux... done, booting the kernel." I'm guessing that this is due to PR 1169 but I'm not sure how to fix this without re-flashing the image. (Which seems somewhat outdated by now, a new build would be appreciated.) I'm not familiar with uboot and I can't find a manual for this specific version, so I'm quite confused at this point. Can someone please walk me through getting my system booting again?
Recommended Posts