RAID 1 - Basic Health check commands


#1

Hi,

What are the basic commands that one could use to check a health of RAID1? What happens if one disk fails? Will the system trigger some error message? Or user needs to continually monitor log files?
thank you


#2

check /proc/mdstat for health check
You can always create a cron job to mail you periodically about the status of the drives.


#3

The reason I asked is because I’m frequently getting this message at the boot which stops system from booting and I need to enter ^D to continue to boot or enter root password and reboot.

dmesg | grep md3
[   12.545589] md/raid1:md3: active with 2 out of 2 mirrors
[   12.558315] created bitmap (15 pages) for device md3
[   12.568714] md3: bitmap initialized from disk: read 1/1 pages, set 0 of 29479 bits
[   12.617839] md3: detected capacity change from 0 to 1978285285376
[   12.688416]  md3: unknown partition table
[   14.869197] udevd[519]: '/sbin/blkid -o udev -p /dev/md3' [961] terminated by signal 15 (Terminated)

==> here it stops booting... ^D to continue to boot or enter root password

[   31.253200] systemd-fsck[1022]: /dev/md3: clean, 1979/120750080 files, 10850286/482979806 blocks
[   32.151784] EXT4-fs (md3): mounted filesystem with ordered data mode. Opts: acl,user_xattr
[   43.288203] EXT4-fs (md3): re-mounted. Opts: acl,user_xattr,commit=0

I knew about

cat /proc/mdstat

command and it also does not reveal anything out of ordinary:

cat /proc/mdstat 
Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
md3 : active raid1 sdb5[1] sda5[0]
      1931919224 blocks super 1.0 [2/2] [UU]
      bitmap: 3/15 pages [12KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
      96344 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      1951884 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md2 : active raid1 sda3[0] sdb3[1]
      19534968 blocks super 1.0 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

My system is OpenSUSE:

cat /etc/issue
Welcome to openSUSE 12.1 "Asparagus" RC 1  - Kernel \r (\l).

For completeness here are my partition tables for sda and sdb - > md:

# fdisk -l /dev/sda /dev/sdb

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00043930

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *          63      192779       96358+  fd  Linux raid autodetect
/dev/sda2          192780     4096574     1951897+  fd  Linux raid autodetect
/dev/sda3         4096575    43166654    19535040   fd  Linux raid autodetect
/dev/sda4        43167744  3907028991  1931930624    f  W95 Ext'd (LBA)
/dev/sda5        43169792  3907008511  1931919360   fd  Linux raid autodetect

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0007b252

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *          63      192779       96358+  fd  Linux raid autodetect
/dev/sdb2          192780     4096574     1951897+  fd  Linux raid autodetect
/dev/sdb3         4096575    43166654    19535040   fd  Linux raid autodetect
/dev/sdb4        43167744  3907028991  1931930624    f  W95 Ext'd (LBA)
/dev/sdb5        43169792  3907008511  1931919360   fd  Linux raid autodetect

can anyone give me some hints on what is going wrong?

thank you


#4

You can try to recreate your config file:


mv mdadm.conf mdadm.conf.old
mdadm --examine --scan >> mdadm.conf

I also suspect there might be a bug in blkid, so first update your OpenSUSE, as I noticed it’s at RC1.


#5

For comparison

OLD:

 # cat mdadm.conf 
DEVICE containers partitions
ARRAY /dev/md/0 UUID=3be9cb66:3913cafa:a402a78b:84d5ca9a
ARRAY /dev/md/1 UUID=4ab789d5:54d23a90:b482cf0e:f587b941
ARRAY /dev/md/2 UUID=948af06b:caf993d4:a5887ee1:c7043c39
ARRAY /dev/md/3 UUID=206c6e19:7b14bb75:a351c4e4:c8e77d87

NEW:

 # mdadm --examine --scan
ARRAY /dev/md/0 metadata=1.0 UUID=3be9cb66:3913cafa:a402a78b:84d5ca9a name=linux.site:0
ARRAY /dev/md/1 metadata=1.0 UUID=4ab789d5:54d23a90:b482cf0e:f587b941 name=linux.site:1
ARRAY /dev/md/2 metadata=1.0 UUID=948af06b:caf993d4:a5887ee1:c7043c39 name=linux.site:2
ARRAY /dev/md/3 metadata=1.0 UUID=206c6e19:7b14bb75:a351c4e4:c8e77d87 name=linux.site:3

I’ll give it few reboots and time to see if this makes any difference…


#6

Still no help. I had my doubts that the above would help. But it was worth to try.

There is little bit more going on:

# dmesg | grep -w md
[    0.000000] Kernel command line: root=/dev/disk/by-id/md-uuid-948af06b:caf993d4:a5887ee1:c7043c39 resume=/dev/disk/by-id/md-uuid-4ab789d5:54d23a90:b482cf0e:f587b941 splash=silent quiet vga=0x31a
[    1.422491] PM: Checking hibernation image partition /dev/disk/by-id/md-uuid-4ab789d5:54d23a90:b482cf0e:f587b941
[    3.772296] md: bind<sdb2>
[    3.775183] md: bind<sda3>
[    3.777703] md: bind<sdb3>
[    3.779887] md: raid1 personality registered for level 1
[    3.780100] md/raid1:md2: active with 2 out of 2 mirrors
[    3.791642] md: bind<sda2>
[    3.793490] md/raid1:md1: active with 2 out of 2 mirrors
[    4.250403] md: raid0 personality registered for level 0
[    4.253306] md: raid10 personality registered for level 10
[    4.952822] md: raid6 personality registered for level 6
[    4.952824] md: raid5 personality registered for level 5
[    4.952826] md: raid4 personality registered for level 4
[   10.053305] md: md0 stopped.
[   10.057398] md: bind<sdb1>
[   10.057564] md: bind<sda1>
[   10.073694] md/raid1:md0: active with 2 out of 2 mirrors
[   10.320274] boot.md[522]: Starting MD RAID mdadm: /dev/md/0 has been started with 2 drives.
[   10.609477] md: bind<sda5>
[   10.936257] systemd[1]: Job fsck@dev-disk-by\x2did-md\x2duuid\x2d206c6e19:7b14bb75:a351c4e4:c8e77d87.service/start failed with result 'dependency'.
[   11.028990] md: could not open unknown-block(8,21).
[   11.029060] md: md_import_device returned -16
[   11.255020] md: bind<sdb5>
[   11.723384] md/raid1:md3: active with 2 out of 2 mirrors
[   12.592269] systemd[1]: md.service: control process exited, code=exited status=3
[   12.688038] systemd[1]: Unit md.service entered failed state.
[   12.753286] boot.md[994]: Not shutting down MD RAID - reboot/halt scripts do this...missing
[   23.557887] boot.md[1048]: Starting MD RAID ..done

My last boot failed because of:

[   11.028990] md: could not open unknown-block(8,21).

However, the raid seems to be functioning properly after successful boot or when I hit ^d. Any other ideas on what to try.


#7

First, did you update your system? Second, does the system boot successfully or not? Third, please paste your entire dmesg.


#8

Yes system is up to date. I can successfully boot but sometimes I need to interfere and press ^D . The raid looks to be healthy after boot. Next time when I get error message I will post full dmesg.


#9

LiLo, when you use the cat /proc/mdstat command what you are looking for is to make sure the two letters ‘[UU]’ are there. If you see something else there is a problem. On my server I see ‘[UUUU]’ for RAID 10. I tested this by removing a drive and then having it rebuild itself. It was cool to see the machine boot up successfully without a drive and all my data was there.


#10

LiLo, when you use the cat /proc/mdstat command what you are looking for is to make sure the two letters ‘[UU]’ are there. If you see something else there is a problem. On my server I see ‘[UUUU]’ for RAID 10. I tested this by removing a drive and then having it rebuild itself. It was cool to see the machine boot up successfully without a drive and all my data was there.

thanks for this. I guess no problem there:


cat /proc/mdstat 
Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid1 sdb1[1] sda1[0]
      96344 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sdb5[1] sda5[0]
      1931919224 blocks super 1.0 [2/2] [UU]
      bitmap: 3/15 pages [12KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      1951884 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md2 : active raid1 sda3[0] sdb3[1]
      19534968 blocks super 1.0 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

#11

#12