Jump to content

Btrfs scrub causes crash.


Tonat

Recommended Posts

I have the Helios with two 10T disks mounted as a BTRFS raid 1 array. This setup is stable, although I have not stressed it a lot, except for when I do a btrfs scrub.

It is using the 20.11.1 Buster image on SDcard. The disks have about 900GB of files.

A btrfs scrub causes a crash with the following kernel output. Is this a known bug?

 

Spoiler

[351424.849687] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[351424.850193] Modules linked in: rfkill governor_performance r8152 zstd snd_soc_hdmi_codec snd_soc_rockchip_i2s hantro_vpu(C) snd_soc_core rockchip_vdec(C) leds_pwm v4l2_h264 panfrost rockchip_rga snd_pcm_dmaengine videobuf2_dma_contig snd_pcm videobuf2_dma_sg rockchipdrm v4l2_mem2mem videobuf2_vmalloc snd_timer videobuf2_memops dw_mipi_dsi videobuf2_v4l2 gpio_charger fusb302 pwm_fan gpu_sched dw_hdmi snd analogix_dp tcpm videobuf2_common drm_kms_helper soundcore typec videodev sg cec mc rc_core drm drm_panel_orientation_quirks gpio_beeper cpufreq_dt zram ledtrig_netdev lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs adc_keys
[351424.856212] CPU: 5 PID: 912 Comm: kworker/u12:15 Tainted: G         C        5.9.11-rockchip64 #20.11.1
[351424.857046] Hardware name: Helios64 (DT)
[351424.857418] Workqueue: btrfs-scrub btrfs_work_helper
[351424.857870] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--)
[351424.858375] pc : do_undefinstr+0x2ec/0x310
[351424.858748] lr : do_undefinstr+0x1e0/0x310
[351424.859120] sp : ffff80001c233780
[351424.859424] x29: ffff80001c233780 x28: ffff0000e4900000
[351424.859905] x27: ffff0000aabcca00 x26: 0000000000000000
[351424.860385] x25: 0000000000001000 x24: ffff0000be747400
[351424.860865] x23: 0000000020000005 x22: ffff8000106dcd9c
[351424.861345] x21: ffff80001c233930 x20: ffff0000e4900000
[351424.861824] x19: ffff80001c2337f0 x18: ffff0000e63d0488
[351424.862304] x17: 0000000000000000 x16: 0000000000000200
[351424.862783] x15: 0000000000000100 x14: 0000000000000130
[351424.863263] x13: 000000000000035c x12: 0000000000000384
[351424.863741] x11: 0000000000000001 x10: 0000000000000a30
[351424.864220] x9 : ffff0000a81b5874 x8 : 0000000000000000
[351424.864699] x7 : 0000000000000001 x6 : 0000000000000001
[351424.865179] x5 : 00000000d5300000 x4 : ffff800011856258
[351424.865658] x3 : 000000009ac00000 x2 : 0000000000000001
[351424.866137] x1 : ffff0000e4900000 x0 : 0000000020000005
[351424.866618] Call trace:
[351424.866849]  do_undefinstr+0x2ec/0x310
[351424.867195]  el1_sync_handler+0x88/0x110
[351424.867553]  el1_sync+0x7c/0x100
[351424.867856]  __crc32c_le+0x9c/0xe8
[351424.868171]  crypto_shash_digest+0x34/0x58
[351424.868548]  scrub_checksum_data+0xac/0x130
[351424.868930]  scrub_bio_end_io_worker+0x168/0x2d0
[351424.869348]  btrfs_work_helper+0xf4/0x5d0
[351424.869715]  process_one_work+0x1c4/0x470
[351424.870082]  worker_thread+0x4c/0x420
[351424.870419]  kthread+0x118/0x150
[351424.870718]  ret_from_fork+0x10/0x34
[351424.871050] Code: f9401bf7 17ffff7d a9025bf5 f9001bf7 (d4210000)
[351424.871600] ---[ end trace f679134976c93019 ]---
[351424.872020] note: kworker/u12:15[912] exited with preempt_count 1

Message from syslogd@localhost at Dec  4 15:23:47 ...
 kernel:[351424.849687] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

[351424.875630] ------------[ cut here ]------------.

 kernel:[351424.871050] Code: f9401bf7 17ffff7d a9025bf5 f9001b[351424.876415] WARNING: CPU: 5 PID: 0 at kernel/rcu/tree.c:630 rcu_eqs_enter.isra.0+0x150/0x168
f7 (d4210000)
[351424.877622] Modules linked in: rfkill governor_performance r8152 zstd snd_soc_hdmi_codec snd_soc_rockchip_i2s hantro_vpu(C) snd_soc_core rockchip_vdec(C) leds_pwm v4l2_h264 panfrost rockchip_rga snd_pcm_dmaengine videobuf2_dma_contig snd_pcm videobuf2_dma_sg rockchipdrm v4l2_mem2mem videobuf2_vmalloc snd_timer videobuf2_memops dw_mipi_dsi videobuf2_v4l2 gpio_charger fusb302 pwm_fan gpu_sched dw_hdmi snd analogix_dp tcpm videobuf2_common drm_kms_helper soundcore typec videodev sg cec mc rc_core drm drm_panel_orientation_quirks gpio_beeper cpufreq_dt zram ledtrig_netdev lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs adc_keys
[351424.883649] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G      D  C        5.9.11-rockchip64 #20.11.1
[351424.884430] Hardware name: Helios64 (DT)
[351424.884790] pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
[351424.885295] pc : rcu_eqs_enter.isra.0+0x150/0x168
[351424.885723] lr : rcu_eqs_enter.isra.0+0x1c/0x168
[351424.886140] sp : ffff800011bdbee0
[351424.886444] x29: ffff800011bdbee0 x28: 0000000000000000
[351424.886926] x27: ffff0000f6eae580 x26: 0000000000000000
[351424.887406] x25: 0000000000000000 x24: ffff80001129e970
[351424.887886] x23: ffff0000f6eae580 x22: ffff800011533838
[351424.888367] x21: ffff80001184a2a0 x20: ffff800011849980
[351424.888846] x19: ffff8000115358c0 x18: 000000000000000e
[351424.889325] x17: 0000000000000001 x16: 0000000000000019
[351424.889804] x15: 000000000000010c x14: 0000000000000135
[351424.890282] x13: 0000000000000003 x12: 0000000000000000
[351424.890762] x11: 0000000000000006 x10: 0000000000000a30
[351424.891241] x9 : ffff800011bdbe70 x8 : ffff0000f6eaf010
[351424.891721] x7 : 0000000000000004 x6 : 000007abd3179bab
[351424.892201] x5 : 00ffffffffffffff x4 : 000000000192043e
[351424.892681] x3 : ffff80001151f018 x2 : 4000000000000000
[351424.893160] x1 : 4000000000000002 x0 : ffff0000f77c88c0
[351424.893640] Call trace:
[351424.893873]  rcu_eqs_enter.isra.0+0x150/0x168
[351424.894274]  rcu_idle_enter+0x10/0x20
[351424.894612]  default_idle_call+0x2c/0x1b8
[351424.894981]  do_idle+0x210/0x260
[351424.895280]  cpu_startup_entry+0x24/0x60
[351424.895645]  secondary_start_kernel+0x148/0x180
[351424.896056] ---[ end trace f679134976c9301a ]---
[351484.867846] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[351484.868442] rcu:    4-...0: (46 ticks this GP) idle=42e/1/0x4000000000000000 softirq=625499/625500 fqs=5588
[351484.869322]         (detected by 2, t=15002 jiffies, g=4124489, q=273)

 

 

 

 

Edited by TRS-80
put long output inside code block inside spoiler
Link to comment
Share on other sites

Today is your lucky day.  :D

 

Jokes aside, I have been meaning to write this for some time.  So finally I made time just now to make (my own, separate) thread about why I am not a fan of btrfs.  This has nothing to do with you personally, and I hope you don't think I am singling you out because I am not.

 

Maybe that has something to do with your issue, or maybe it doesn't.  In either case, I really do wish you the best of luck sorting it out.  Also, for what I hope are now obvious reasons, I am afraid I don't know enough about btrfs to help in any further detail.  Hopefully someone else will.

 

Cheers and happy Friday!  :beer:

Link to comment
Share on other sites

Huh, that's strange.  I've been using btrfs on 5 14TB drives in a RAID-1.  I've done several scrubs without issue, including one where I purposefully corrupted data on one of the drives while it was offline, to test the ability to recover from this.  I think I might have done that first test on the 5.8 kernel, but I've definitely had the scrub run from cron since upgrading to 5.9 (I'm currently on 5.9.11).

 

The only difference in my scenario is that my OS is installed to the eMMC, but I can't imagine how that would change anything.

 

I have also converted the root filesystem to btrfs.  I haven't done a scrub on it (there's not much point in a scrub without a RAID configuration), but it's been working fine for a couple of weeks.

 

Any btrfs messages in the kernel log?  Try running 'dmesg | grep -i btrfs'.  What are your btrfs mount options?  Here is what I have:

 

$ cat /etc/fstab
...
/dev/disk/by-label/helios64_btrfs_raid1         /srv/dev-disk-by-label-helios64_btrfs_raid1     btrfs   defaults,nofail,noatime,nodiratime,compress=zstd        0 2

 

Some info about my btrfs filesystem:

 

$ lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda            8:0    0 12.8T  0 disk 
└─sda1         8:1    0 12.8T  0 part /srv/dev-disk-by-label-helios64_btrfs_raid1
sdb            8:16   0 12.8T  0 disk 
└─sdb1         8:17   0 12.8T  0 part 
sdc            8:32   0 12.8T  0 disk 
└─sdc1         8:33   0 12.8T  0 part 
sdd            8:48   0 12.8T  0 disk 
└─sdd1         8:49   0 12.8T  0 part 
sde            8:64   0 12.8T  0 disk 
└─sde1         8:65   0 12.8T  0 part 
mmcblk1      179:0    0 14.6G  0 disk 
├─mmcblk1p1  179:1    0  512M  0 part /boot
└─mmcblk1p2  179:2    0   14G  0 part /mnt/btr_pool
mmcblk1boot0 179:32   0    4M  1 disk 
mmcblk1boot1 179:64   0    4M  1 disk 

 

$ sudo btrfs fi usage /srv/dev-disk-by-label-helios64_btrfs_raid1/
Overall:
    Device size:                  63.67TiB
    Device allocated:             18.19TiB
    Device unallocated:           45.47TiB
    Device missing:                  0.00B
    Used:                         18.19TiB
    Free (estimated):             22.74TiB      (min: 22.74TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,RAID1: Size:9.08TiB, Used:9.08TiB (99.97%)
   /dev/sda1       3.63TiB
   /dev/sdb1       3.63TiB
   /dev/sdc1       3.63TiB
   /dev/sdd1       3.64TiB
   /dev/sde1       3.63TiB

Metadata,RAID1: Size:12.00GiB, Used:11.61GiB (96.75%)
   /dev/sda1       6.00GiB
   /dev/sdb1       7.00GiB
   /dev/sdc1       4.00GiB
   /dev/sdd1       3.00GiB
   /dev/sde1       4.00GiB

System,RAID1: Size:8.00MiB, Used:1.25MiB (15.62%)
   /dev/sdd1       8.00MiB
   /dev/sde1       8.00MiB

Unallocated:
   /dev/sda1       9.09TiB
   /dev/sdb1       9.09TiB
   /dev/sdc1       9.09TiB
   /dev/sdd1       9.09TiB
   /dev/sde1       9.09TiB

 

$ dmesg | grep -i btrfs
[    0.000000] Kernel command line: root=UUID=db73782d-ba22-47ed-bc7d-60deedc591ec rootwait rootfstype=btrfs earlycon console=ttyS2,1500000 console=tty1 consoleblank=0 loglevel=7 ubootpart=794023b5-01 usb-storage.quirks=0x2537:0x1066:u,0x2537:0x1068:u rootflags=subvol=@  cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1
[    3.297371] Btrfs loaded, crc32c=crc32c-generic
[   14.920125] BTRFS: device label helios64_btrfs_raid1 devid 5 transid 7810 /dev/sde1 scanned by btrfs (362)
[   14.922109] BTRFS: device label helios64_btrfs_raid1 devid 4 transid 7810 /dev/sdd1 scanned by btrfs (362)
[   14.924064] BTRFS: device label helios64_btrfs_raid1 devid 3 transid 7810 /dev/sdc1 scanned by btrfs (362)
[   14.926025] BTRFS: device label helios64_btrfs_raid1 devid 2 transid 7810 /dev/sdb1 scanned by btrfs (362)
[   14.927975] BTRFS: device label helios64_btrfs_raid1 devid 1 transid 7810 /dev/sda1 scanned by btrfs (362)
[   14.929590] BTRFS: device fsid db73782d-ba22-47ed-bc7d-60deedc591ec devid 1 transid 10019 /dev/mmcblk1p2 scanned by btrfs (362)
[   15.002086] BTRFS info (device mmcblk1p2): disk space caching is enabled
[   15.002715] BTRFS info (device mmcblk1p2): has skinny extents
[   15.013757] BTRFS info (device mmcblk1p2): enabling ssd optimizations
[   16.477703] BTRFS info (device mmcblk1p2): use zstd compression, level 3
[   16.477726] BTRFS info (device mmcblk1p2): disk space caching is enabled
[   18.389379] BTRFS info (device sda1): use zstd compression, level 3
[   18.389985] BTRFS info (device sda1): disk space caching is enabled
[   18.390539] BTRFS info (device sda1): has skinny extents

 

$ sudo btrfs scrub status /srv/dev-disk-by-label-helios64_btrfs_raid1/
UUID:             41ded890-363f-4b35-9b75-b2048662bb38
Scrub started:    Tue Dec  1 23:15:01 2020
Status:           finished
Duration:         7:01:13
Total to scrub:   18.19TiB
Rate:             748.36MiB/s
Error summary:    no errors found

 

The only other difference is that I installed 'btrfs-progs' from the 'buster-backports' repository so that I would get the version matching the 5.9 kernel, rather than the 4.19 version.  This was only a relatively recent change for me though and I'm pretty sure my initial testing was with the 4.19 version of btrfs-progs.  It shouldn't matter, since the scrub is done by the kernel, and 'btrfs-progs' simply tells the kernel to start it.

Link to comment
Share on other sites

In the meantime I received a quite interesting reply on my above linked thread from none other than tkaiser himself.  It's quite long and full of lots of good info, but the TL;DR is that I am re-thinking my position on btrfs now, most especially on ARM (while overall, and on x86 especially I maintain my position).  But maybe you guys know all of that already and that's why you use btrfs in the first place.  :D

 

However if I am going to shit up someone else's thread, especially with (apparently wrong) information, the very least I feel I should do is return and issue a retraction / update.

 

Cheers!

Link to comment
Share on other sites

Compared to the system of m11k, my Helios64 has two 10T Ironwolf disks in slots 1 & 2, other slots are empty.

The mount options are the same except I didn't use nodiratime (I think it is redundant because noatime is used).
Apart from the btrfs scrub I have not had any problems at all.

 

As an experiment I tried the following to see if it would crash again

  1. Do a scrub
  2. Simultaneously create two compressed tar files of the data on the disks (trying to produce significant processor and disk activity)
  3. After about 6hours I stopped the tars and did another scrub.
  4. Simultaneously un-tar the files in step 2 (which finished with expected error as the files were incomplete)
  5. Do another scrub.

All the above worked correctly.

 

This makes me think the crash originally reported was a one off event (but it would be good to know if there is an underlying cause).

I will let you know if I have any more problems.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines