RAID阵列写入就出错，如何检查是硬盘问题还是卡问题？

系统是Debian 7.0.2 AMD64
卡是LSI 9261-8i
硬盘是Toshiba DT01ACA300 3T * 12
组的RAID 6
第一次组的时候，创建完阵列，还在后台初始化的时候，我就开始写入数据，当时有出错，阵列直接offline，重启的时候提示RAID卡内存错误。
当时想说应该是还在初始化的时候我就写入数据导致的，所以就没理它，等到初始化完毕后才开始写入数据。
然后2个礼拜前，再次要写入数据的时候，才开始写入没5分钟，阵列又offline了，这次直接联系RAID卡的卖家，让换了一个新卡(一换就是10天.....Orz)

新卡拿来后，读取没问题，写入还是写入没几分钟就offline
用smartctl -a /dev/sda -d sat+megaraid,{device id}
命令进行检测，发现有4个硬盘有错误(刚忘了把日志拷贝出来...现在机器在运行磁盘检测看不到)
把这4个硬盘拿出来用HDD Tune 5.5.0版本检查，只有C7 CRC接口错误(值为200，阀值也是200)

阵列掉线时的日志如下
[ 93.702174] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null)
[ 97.693024] EXT4-fs (sdd1): mounted filesystem with ordered data mode. Opts: (null)
[173.945570] megaraid_sas: wait adp restart
[173.945577] megasas: moving cmd[0]:ffff880428d6f8c0:0:ffff880429aa7980 the defer queue as internal
[173.945854] megasas: moving cmd[157]:ffff880429be8f40:0:ffff8804298681c0 the defer queue as internal
[173.945855] megasas: fwState=f0000000, stage:1
[173.945876] megaraid_sas: FW detected to be in faultstate, restarting it...
[174.947303] ADP_RESET_GEN2: HostDiag=a0
[185.055067] RESET_GEN2: retry=0, hostdiag=a4
[289.052430] RESET_GEN2: retry=3e8, hostdiag=a4
[289.052434] megaraid_sas: FW restarted successfully,initiating next stage...
[289.052435] megaraid_sas: HBA recovery state machine,state 2 starting...
[319.171664] megasas: Waiting for FW to come to ready state
[355.246848] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[355.246852] megasas: HBA reset wait ...
[361.214884] jbd2/sda1-8 D ffff88043e633780 0 1393 2 0x00000000
[361.214889] ffff88042c56a2c0 0000000000000046 0000000000000000 ffff88042de4a0c0
[361.214895] 0000000000013780 ffff8804284ebfd8 ffff8804284ebfd8 ffff88042c56a2c0
[361.214899] 0000000000000246 000000018134f209 ffff88042a5038e0 ffff88042a503800
[361.214904] Call Trace:
[361.214919] [<ffffffffa013e007>] ? jbd2_journal_commit_transaction+0x1a6/0x10bf [jbd2]
[361.214926] [<ffffffff8100d02f>] ? load_TLS+0x7/0xa
[361.214930] [<ffffffff8100d69f>] ? __switch_to+0x133/0x258
[361.214936] [<ffffffff81039ac2>] ? finish_task_switch+0x88/0xb9
[361.214940] [<ffffffff81070fc1>] ? arch_local_irq_save+0x11/0x17
[361.214946] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.214951] [<ffffffff8134f247>] ? _raw_spin_unlock_irqrestore+0xe/0xf
[361.214958] [<ffffffffa0142156>] ? kjournald2+0xc0/0x20a [jbd2]
[361.214962] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.214969] [<ffffffffa0142096>] ? commit_timeout+0x5/0x5 [jbd2]
[361.214973] [<ffffffff8105f631>] ? kthread+0x76/0x7e
[361.214978] [<ffffffff81356374>] ? kernel_thread_helper+0x4/0x10
[361.214982] [<ffffffff8105f5bb>] ? kthread_worker_fn+0x139/0x139
[361.214986] [<ffffffff81356370>] ? gs_change+0x13/0x13
[361.215227] ext4lazyinit D ffff88043e633780 0 1395 2 0x00000000
[361.215231] ffff88042a764240 0000000000000046 0000000000000000 ffff88042de4a0c0
[361.215236] 0000000000013780 ffff88042bd29fd8 ffff88042bd29fd8 ffff88042a764240
[361.215240] ffffffff810135d2 0000000181066245 ffff88042aa61a60 ffff88043e633fd0
[361.215245] Call Trace:
[361.215249] [<ffffffff810135d2>] ? read_tsc+0x5/0x14
[361.215252] [<ffffffff8134e141>] ? io_schedule+0x59/0x71
[361.215257] [<ffffffff811995fe>] ? get_request_wait+0x105/0x18f
[361.215261] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.215266] [<ffffffff8119a553>] ? blk_queue_bio+0x17f/0x28c
[361.215270] [<ffffffff8119908a>] ? generic_make_request+0x90/0xcf
[361.215274] [<ffffffff8119919c>] ? submit_bio+0xd3/0xf1
[361.215279] [<ffffffff81120e66>] ? bio_alloc_bioset+0x69/0xb6
[361.215283] [<ffffffff8119deb6>] ? blkdev_issue_zeroout+0xff/0x147
[361.215293] [<ffffffffa0150bb4>] ? ext4_init_inode_table+0x19d/0x241 [ext4]
[361.215297] [<ffffffff8105263a>] ? del_timer_sync+0x34/0x3e
[361.215301] [<ffffffff81036628>] ? should_resched+0x5/0x23
[361.215313] [<ffffffffa016525c>] ? ext4_lazyinit_thread+0xf3/0x228 [ext4]
[361.215323] [<ffffffffa0165169>] ? ext4_register_li_request+0x207/0x207 [ext4]
[361.215328] [<ffffffff8105f631>] ? kthread+0x76/0x7e
[361.215332] [<ffffffff81356374>] ? kernel_thread_helper+0x4/0x10
[361.215336] [<ffffffff8105f5bb>] ? kthread_worker_fn+0x139/0x139
[361.215340] [<ffffffff81356370>] ? gs_change+0x13/0x13
[361.215585] flush-8:0 D ffff88043e613780 0 2489 2 0x00000000
[361.215588] ffff88042b8c1120 0000000000000046 0000000000000000 ffffffff8160d020
[361.215593] 0000000000013780 ffff88042bb01fd8 ffff88042bb01fd8 ffff88042b8c1120
[361.215597] ffff88042bb015d0 000000012bb015d0 ffff880428c15980 ffff88043e613fd0
[361.215601] Call Trace:
[361.215605] [<ffffffff8134e141>] ? io_schedule+0x59/0x71
[361.215609] [<ffffffff811995fe>] ? get_request_wait+0x105/0x18f
[361.215613] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.215618] [<ffffffff8119a553>] ? blk_queue_bio+0x17f/0x28c
[361.215622] [<ffffffff8119908a>] ? generic_make_request+0x90/0xcf
[361.215627] [<ffffffff811aea5b>] ? radix_tree_tag_set+0x64/0xbf
[361.215631] [<ffffffff8119919c>] ? submit_bio+0xd3/0xf1
[361.215636] [<ffffffff810bbc9a>] ? test_set_page_writeback+0xdc/0xeb
[361.215644] [<ffffffffa0156f5f>] ? ext4_io_submit+0x21/0x4a [ext4]
[361.215653] [<ffffffffa0157185>] ? ext4_bio_write_page+0x1fd/0x3bc [ext4]
[361.215658] [<ffffffff810b6994>] ? find_get_pages+0x82/0xd4
[361.215667] [<ffffffffa0153d86>] ? mpage_da_submit_io+0x2bd/0x36f [ext4]
[361.215676] [<ffffffffa0155ad0>] ? mpage_da_map_and_submit+0x2e3/0x2f9 [ext4]
[361.215684] [<ffffffffa0155dcd>] ? write_cache_pages_da+0x214/0x2c5 [ext4]
[361.215693] [<ffffffffa0156120>] ? ext4_da_writepages+0x2a2/0x45d [ext4]
[361.215700] [<ffffffff811183f3>] ? writeback_single_inode+0x11d/0x2cc
[361.215704] [<ffffffff81118873>] ? writeback_sb_inodes+0x16b/0x204
[361.215709] [<ffffffff81118979>] ? __writeback_inodes_wb+0x6d/0xab
[361.215714] [<ffffffff81118adf>] ? wb_writeback+0x128/0x21f
[361.215717] [<ffffffff81070fc1>] ? arch_local_irq_save+0x11/0x17
[361.215722] [<ffffffff81118f96>] ? wb_do_writeback+0x146/0x1a8
[361.215727] [<ffffffff8111907d>] ? bdi_writeback_thread+0x85/0x1e6
[361.215731] [<ffffffff81118ff8>] ? wb_do_writeback+0x1a8/0x1a8
[361.215736] [<ffffffff8105f631>] ? kthread+0x76/0x7e
[361.215739] [<ffffffff81356374>] ? kernel_thread_helper+0x4/0x10
[361.215744] [<ffffffff8105f5bb>] ? kthread_worker_fn+0x139/0x139
[361.215748] [<ffffffff81356370>] ? gs_change+0x13/0x13
[361.215987] cp D ffff88043e613780 0 2651 2554 0x00000000
[361.215991] ffff88042a0faa70 0000000000000082 ffff880400000000 ffffffff8160d020
[361.215996] 0000000000013780 ffff88042c5f3fd8 ffff88042c5f3fd8 ffff88042a0faa70
[361.216000] 0000000000000246 000000018134f209 ffff88042a503868 ffff88042a503800
[361.216004] Call Trace:
[361.216010] [<ffffffffa013c393>] ? start_this_handle+0x2ca/0x448 [jbd2]
[361.216014] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.216020] [<ffffffffa013c79c>] ? jbd2__journal_start+0x8a/0xce [jbd2]
[361.216028] [<ffffffffa01538ce>] ? ext4_da_write_begin+0xd4/0x1be [ext4]
[361.216040] [<ffffffffa016b6d8>] ? ext4_journal_start_sb+0x139/0x14f [ext4]
[361.216048] [<ffffffffa01538ce>] ? ext4_da_write_begin+0xd4/0x1be [ext4]
[361.216056] [<ffffffffa01557ac>] ? ext4_da_write_end+0x1f1/0x232 [ext4]
[361.216061] [<ffffffff810b4f8b>] ? generic_file_buffered_write+0x10f/0x259
[361.216067] [<ffffffff810b5dc8>] ? __generic_file_aio_write+0x248/0x278
[361.216071] [<ffffffff8104b2ae>] ? current_fs_time+0x31/0x37
[361.216075] [<ffffffff81036628>] ? should_resched+0x5/0x23
[361.216079] [<ffffffff810b5e55>] ? generic_file_aio_write+0x5d/0xb5
[361.216087] [<ffffffffa014ebd4>] ? ext4_file_write+0x1e1/0x235 [ext4]
[361.216092] [<ffffffff810fa054>] ? do_sync_write+0xb4/0xec
[361.216096] [<ffffffff81164649>] ? security_file_permission+0x16/0x2d
[361.216100] [<ffffffff810fa745>] ? vfs_write+0xa2/0xe9
[361.216104] [<ffffffff810fa922>] ? sys_write+0x45/0x6b
[361.216108] [<ffffffff81354212>] ? system_call_fastpath+0x16/0x1b
[535.962198] megasas: reset: Stopping HBA.
[535.962362] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[535.962510] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[535.962656] sd 0:2:0:0: Device offlined - not ready after error recovery
[535.963023] sd 0:2:0:0: Device offlined - not ready after error recovery
[535.963030] sd 0:2:0:0: [sda] Unhandled error code
[535.963032] sd 0:2:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[535.963036] sd 0:2:0:0: [sda] CDB: Write(10): 2a 00 00 be a8 00 00 01 00 00
[535.967839] EXT4-fs warning (device sda1): ext4_end_bio:250: I/O error writing to inode 424345603 (offset 5876219904 size 135168 starting block 1561888)
[535.967996] sd 0:2:0:0: [sda] Unhandled error code

fuxkcsdn

2013 年 12 月 23 日

@Marble
话说在Windows下检测出C7警告的硬盘，在Linux下用smartctl检测则是出现这个错误信息
这信息是啥意思？？

SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 72 hours (3 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 95 6b 4e 04 04 Error: ICRC, ABRT at LBA = 0x04044e6b = 67391083

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 00 00 4e 04 40 00 00:14:39.085 WRITE FPDMA QUEUED
61 00 00 00 4d 04 40 00 00:14:39.084 WRITE FPDMA QUEUED
61 00 00 00 47 04 40 00 00:14:39.082 WRITE FPDMA QUEUED
61 00 28 00 4c 04 40 00 00:14:39.078 WRITE FPDMA QUEUED
61 00 20 00 4b 04 40 00 00:14:39.078 WRITE FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 10 hours (0 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 d3 2d d8 5c 09 Error: ICRC, ABRT at LBA = 0x095cd82d = 157079597

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 00 00 d8 5c 40 00 01:05:23.837 WRITE FPDMA QUEUED
61 00 00 00 d7 5c 40 00 01:05:23.836 WRITE FPDMA QUEUED
61 00 00 00 d6 5c 40 00 01:05:23.835 WRITE FPDMA QUEUED
61 00 00 00 d5 5c 40 00 01:05:23.834 WRITE FPDMA QUEUED
61 00 00 00 d4 5c 40 00 01:05:23.833 WRITE FPDMA QUEUED