Some textfiles (XMLs to be exact) written by a background script, suddenly had 0 Bytes. Not all of them, but only the files written between ~00:30am and ~06:40am. After that, everything was fine again. WTF?
First I thought that my script was buggy, but I've checked a directory listing (ls -laR) of related folders - and *all* files written between 00:30 and 06:40 had 0 Bytes. W-T-F??
Then I checked the logs (/var/log/messages), and at the same date/time that the 0-Byte files were written, "rsyslogd" also had a gap:
No message between: "00:30 - 06:46" - while before 00:30, there were entries every few minutes caused by mounted CIFS shares.May 26 00:30:55 ferry-2 kernel: /source_i386_none/fs/cifs/cifssmb.c: write2 at 321216512 8192 bytes
May 26 06:46:55 ferry-2 rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="1533" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Previous /var/log/message entries show that rsyslogd is "HUPed" every morning at 06:25, but not on this day (=May 26th): the "HUPed"-good-morning line of rsyslogd is at 06:46 - 21 minutes too late.
Then, there's a whole day nothing (May 26th 06:46 - May 27th 06:41) - and then 2 blocks of HDD-related messages, which ended in an "Sense: Unrecovered read error - auto reallocate failed":
The "sense error" on /dev/sdb doesn't look good to me - so let's do some investigation.May 27 06:41:49 ferry-2 kernel: [6456056.487616] ata2.00: configured for UDMA/133
May 27 06:41:49 ferry-2 kernel: [6456056.487627] ata2: EH complete
May 27 06:41:51 ferry-2 kernel: [6456057.717753] ata2.00: configured for UDMA/133
May 27 06:41:51 ferry-2 kernel: [6456057.717762] ata2: EH complete
May 27 06:41:53 ferry-2 kernel: [6456059.970307] ata2.00: configured for UDMA/133
May 27 06:41:53 ferry-2 kernel: [6456059.970317] ata2: EH complete
May 27 06:42:13 ferry-2 kernel: [6456079.891853] ata2.00: configured for UDMA/133
May 27 06:42:13 ferry-2 kernel: [6456079.891858] ata2: EH complete
May 27 06:42:16 ferry-2 kernel: [6456083.526368] ata2.00: configured for UDMA/133
May 27 06:42:16 ferry-2 kernel: [6456083.526378] ata2: EH complete
May 27 06:42:18 ferry-2 kernel: [6456084.752490] ata2.00: configured for UDMA/133
May 27 06:42:18 ferry-2 kernel: [6456084.752501] ata2: EH complete
May 27 06:42:19 ferry-2 kernel: [6456085.998596] ata2.00: configured for UDMA/133
May 27 06:42:19 ferry-2 kernel: [6456085.998605] ata2: EH complete
May 27 06:42:21 ferry-2 kernel: [6456088.127777] ata2.00: configured for UDMA/133
May 27 06:42:21 ferry-2 kernel: [6456088.127786] ata2: EH complete
May 27 06:42:23 ferry-2 kernel: [6456089.596971] ata2.00: configured for UDMA/133
May 27 06:42:23 ferry-2 kernel: [6456089.596984] ata2: EH complete
May 27 06:42:24 ferry-2 kernel: [6456090.966915] ata2.00: configured for UDMA/133
May 27 06:42:24 ferry-2 kernel: [6456090.966926] ata2: EH complete
May 27 06:42:25 ferry-2 kernel: [6456092.205031] ata2.00: configured for UDMA/133
May 27 06:42:25 ferry-2 kernel: [6456092.205042] ata2: EH complete
May 27 06:42:26 ferry-2 kernel: [6456093.491101] ata2.00: configured for UDMA/133
May 27 06:42:26 ferry-2 kernel: [6456093.491113] ata2: EH complete
May 27 06:42:28 ferry-2 kernel: [6456094.721223] ata2.00: configured for UDMA/133
May 27 06:42:28 ferry-2 kernel: [6456094.721232] sd 1:0:0:0: [sdb] Unhandled sense code
May 27 06:42:28 ferry-2 kernel: [6456094.721233] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 27 06:42:28 ferry-2 kernel: [6456094.721235] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
May 27 06:42:28 ferry-2 kernel: [6456094.721238] Descriptor sense data with sense descriptors (in hex):
May 27 06:42:28 ferry-2 kernel: [6456094.721239] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
May 27 06:42:28 ferry-2 kernel: [6456094.721243] 44 8b 4d 8d
May 27 06:42:28 ferry-2 kernel: [6456094.721244] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Since I'm running a kernel soft-raid for the system disk, let's start with checking the RAID's status.
There's a Software RAID HowTo on kernel.org's wiki, which is a good place to start.
Checking "/proc/mdstat" looks good:
...as well as the output of "mdadm --detail /dev/md0":Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
974607224 blocks super 1.2 [2/2] [UU]
unused devices: <none>
So, what the hell hit my system between 00:30 and 06:40? Alien abduction?/dev/md0:
Version : 1.2
Creation Time : Fri Jul 15 11:35:57 2011
Raid Level : raid1
Array Size : 974607224 (929.46 GiB 998.00 GB)
Used Dev Size : 974607224 (929.46 GiB 998.00 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Tue May 29 15:52:24 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : ferry-1:0
UUID : 842c663e:77698fd0:b72ca510:4b20c0cb
Events : 396
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
I've checked /var/log/messages again and there's one line indicating that the RAID's involved:
Then I've searched all logrotation files of "/var/log/messages*" for this message - and it only appeared 2 days before in a few-minutes-burst, then again this time - and that's it.May 27 06:42:40 ferry-2 kernel: [6456107.257422] raid1:md0: read error corrected (8 sectors at 1149978024 on sdb1)
Lucky me, but it looks like /dev/sdb is going down...
Let's do some digging on the "unrecovered read error" message:
I've found a few sites with log-entries similar to mine:
- http://www.howtoforge.com/forums/archiv ... 54399.html
- http://serverfault.com/questions/308048 ... r-re-added
- http://ubuntuforums.org/archive/index.p ... 21205.html
- ...and more...
- Show identity information for device:
Code: Select all
smartctl -i /dev/sdb
- Show device SMART vendor-specific Attributes and values:
Code: Select all
smartctl -A /dev/sdb
- Run short SMART test:
Code: Select all
smartctl -t short /dev/sdb
- Show device log of type "selftest":
Code: Select all
smartctl -l selftest /dev/sdb
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 7428 1150597603