RAID阵列写入就出错,如何检查是硬盘问题还是卡问题?

2013-12-22 17:09:49 +08:00
 fuxkcsdn
系统是Debian 7.0.2 AMD64
卡是LSI 9261-8i
硬盘是Toshiba DT01ACA300 3T * 12
组的RAID 6
第一次组的时候,创建完阵列,还在后台初始化的时候,我就开始写入数据,当时有出错,阵列直接offline,重启的时候提示RAID卡内存错误。
当时想说应该是还在初始化的时候我就写入数据导致的,所以就没理它,等到初始化完毕后才开始写入数据。
然后2个礼拜前,再次要写入数据的时候,才开始写入没5分钟,阵列又offline了,这次直接联系RAID卡的卖家,让换了一个新卡(一换就是10天.....Orz)

新卡拿来后,读取没问题,写入还是写入没几分钟就offline
用smartctl -a /dev/sda -d sat+megaraid,{device id}
命令进行检测,发现有4个硬盘有错误(刚忘了把日志拷贝出来...现在机器在运行磁盘检测看不到)
把这4个硬盘拿出来用HDD Tune 5.5.0版本检查,只有C7 CRC接口错误(值为200,阀值也是200)

阵列掉线时的日志如下
[ 93.702174] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null)
[ 97.693024] EXT4-fs (sdd1): mounted filesystem with ordered data mode. Opts: (null)
[173.945570] megaraid_sas: wait adp restart
[173.945577] megasas: moving cmd[0]:ffff880428d6f8c0:0:ffff880429aa7980 the defer queue as internal
[173.945854] megasas: moving cmd[157]:ffff880429be8f40:0:ffff8804298681c0 the defer queue as internal
[173.945855] megasas: fwState=f0000000, stage:1
[173.945876] megaraid_sas: FW detected to be in faultstate, restarting it...
[174.947303] ADP_RESET_GEN2: HostDiag=a0
[185.055067] RESET_GEN2: retry=0, hostdiag=a4
[289.052430] RESET_GEN2: retry=3e8, hostdiag=a4
[289.052434] megaraid_sas: FW restarted successfully,initiating next stage...
[289.052435] megaraid_sas: HBA recovery state machine,state 2 starting...
[319.171664] megasas: Waiting for FW to come to ready state
[355.246848] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[355.246852] megasas: HBA reset wait ...
[361.214884] jbd2/sda1-8 D ffff88043e633780 0 1393 2 0x00000000
[361.214889] ffff88042c56a2c0 0000000000000046 0000000000000000 ffff88042de4a0c0
[361.214895] 0000000000013780 ffff8804284ebfd8 ffff8804284ebfd8 ffff88042c56a2c0
[361.214899] 0000000000000246 000000018134f209 ffff88042a5038e0 ffff88042a503800
[361.214904] Call Trace:
[361.214919] [<ffffffffa013e007>] ? jbd2_journal_commit_transaction+0x1a6/0x10bf [jbd2]
[361.214926] [<ffffffff8100d02f>] ? load_TLS+0x7/0xa
[361.214930] [<ffffffff8100d69f>] ? __switch_to+0x133/0x258
[361.214936] [<ffffffff81039ac2>] ? finish_task_switch+0x88/0xb9
[361.214940] [<ffffffff81070fc1>] ? arch_local_irq_save+0x11/0x17
[361.214946] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.214951] [<ffffffff8134f247>] ? _raw_spin_unlock_irqrestore+0xe/0xf
[361.214958] [<ffffffffa0142156>] ? kjournald2+0xc0/0x20a [jbd2]
[361.214962] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.214969] [<ffffffffa0142096>] ? commit_timeout+0x5/0x5 [jbd2]
[361.214973] [<ffffffff8105f631>] ? kthread+0x76/0x7e
[361.214978] [<ffffffff81356374>] ? kernel_thread_helper+0x4/0x10
[361.214982] [<ffffffff8105f5bb>] ? kthread_worker_fn+0x139/0x139
[361.214986] [<ffffffff81356370>] ? gs_change+0x13/0x13
[361.215227] ext4lazyinit D ffff88043e633780 0 1395 2 0x00000000
[361.215231] ffff88042a764240 0000000000000046 0000000000000000 ffff88042de4a0c0
[361.215236] 0000000000013780 ffff88042bd29fd8 ffff88042bd29fd8 ffff88042a764240
[361.215240] ffffffff810135d2 0000000181066245 ffff88042aa61a60 ffff88043e633fd0
[361.215245] Call Trace:
[361.215249] [<ffffffff810135d2>] ? read_tsc+0x5/0x14
[361.215252] [<ffffffff8134e141>] ? io_schedule+0x59/0x71
[361.215257] [<ffffffff811995fe>] ? get_request_wait+0x105/0x18f
[361.215261] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.215266] [<ffffffff8119a553>] ? blk_queue_bio+0x17f/0x28c
[361.215270] [<ffffffff8119908a>] ? generic_make_request+0x90/0xcf
[361.215274] [<ffffffff8119919c>] ? submit_bio+0xd3/0xf1
[361.215279] [<ffffffff81120e66>] ? bio_alloc_bioset+0x69/0xb6
[361.215283] [<ffffffff8119deb6>] ? blkdev_issue_zeroout+0xff/0x147
[361.215293] [<ffffffffa0150bb4>] ? ext4_init_inode_table+0x19d/0x241 [ext4]
[361.215297] [<ffffffff8105263a>] ? del_timer_sync+0x34/0x3e
[361.215301] [<ffffffff81036628>] ? should_resched+0x5/0x23
[361.215313] [<ffffffffa016525c>] ? ext4_lazyinit_thread+0xf3/0x228 [ext4]
[361.215323] [<ffffffffa0165169>] ? ext4_register_li_request+0x207/0x207 [ext4]
[361.215328] [<ffffffff8105f631>] ? kthread+0x76/0x7e
[361.215332] [<ffffffff81356374>] ? kernel_thread_helper+0x4/0x10
[361.215336] [<ffffffff8105f5bb>] ? kthread_worker_fn+0x139/0x139
[361.215340] [<ffffffff81356370>] ? gs_change+0x13/0x13
[361.215585] flush-8:0 D ffff88043e613780 0 2489 2 0x00000000
[361.215588] ffff88042b8c1120 0000000000000046 0000000000000000 ffffffff8160d020
[361.215593] 0000000000013780 ffff88042bb01fd8 ffff88042bb01fd8 ffff88042b8c1120
[361.215597] ffff88042bb015d0 000000012bb015d0 ffff880428c15980 ffff88043e613fd0
[361.215601] Call Trace:
[361.215605] [<ffffffff8134e141>] ? io_schedule+0x59/0x71
[361.215609] [<ffffffff811995fe>] ? get_request_wait+0x105/0x18f
[361.215613] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.215618] [<ffffffff8119a553>] ? blk_queue_bio+0x17f/0x28c
[361.215622] [<ffffffff8119908a>] ? generic_make_request+0x90/0xcf
[361.215627] [<ffffffff811aea5b>] ? radix_tree_tag_set+0x64/0xbf
[361.215631] [<ffffffff8119919c>] ? submit_bio+0xd3/0xf1
[361.215636] [<ffffffff810bbc9a>] ? test_set_page_writeback+0xdc/0xeb
[361.215644] [<ffffffffa0156f5f>] ? ext4_io_submit+0x21/0x4a [ext4]
[361.215653] [<ffffffffa0157185>] ? ext4_bio_write_page+0x1fd/0x3bc [ext4]
[361.215658] [<ffffffff810b6994>] ? find_get_pages+0x82/0xd4
[361.215667] [<ffffffffa0153d86>] ? mpage_da_submit_io+0x2bd/0x36f [ext4]
[361.215676] [<ffffffffa0155ad0>] ? mpage_da_map_and_submit+0x2e3/0x2f9 [ext4]
[361.215684] [<ffffffffa0155dcd>] ? write_cache_pages_da+0x214/0x2c5 [ext4]
[361.215693] [<ffffffffa0156120>] ? ext4_da_writepages+0x2a2/0x45d [ext4]
[361.215700] [<ffffffff811183f3>] ? writeback_single_inode+0x11d/0x2cc
[361.215704] [<ffffffff81118873>] ? writeback_sb_inodes+0x16b/0x204
[361.215709] [<ffffffff81118979>] ? __writeback_inodes_wb+0x6d/0xab
[361.215714] [<ffffffff81118adf>] ? wb_writeback+0x128/0x21f
[361.215717] [<ffffffff81070fc1>] ? arch_local_irq_save+0x11/0x17
[361.215722] [<ffffffff81118f96>] ? wb_do_writeback+0x146/0x1a8
[361.215727] [<ffffffff8111907d>] ? bdi_writeback_thread+0x85/0x1e6
[361.215731] [<ffffffff81118ff8>] ? wb_do_writeback+0x1a8/0x1a8
[361.215736] [<ffffffff8105f631>] ? kthread+0x76/0x7e
[361.215739] [<ffffffff81356374>] ? kernel_thread_helper+0x4/0x10
[361.215744] [<ffffffff8105f5bb>] ? kthread_worker_fn+0x139/0x139
[361.215748] [<ffffffff81356370>] ? gs_change+0x13/0x13
[361.215987] cp D ffff88043e613780 0 2651 2554 0x00000000
[361.215991] ffff88042a0faa70 0000000000000082 ffff880400000000 ffffffff8160d020
[361.215996] 0000000000013780 ffff88042c5f3fd8 ffff88042c5f3fd8 ffff88042a0faa70
[361.216000] 0000000000000246 000000018134f209 ffff88042a503868 ffff88042a503800
[361.216004] Call Trace:
[361.216010] [<ffffffffa013c393>] ? start_this_handle+0x2ca/0x448 [jbd2]
[361.216014] [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[361.216020] [<ffffffffa013c79c>] ? jbd2__journal_start+0x8a/0xce [jbd2]
[361.216028] [<ffffffffa01538ce>] ? ext4_da_write_begin+0xd4/0x1be [ext4]
[361.216040] [<ffffffffa016b6d8>] ? ext4_journal_start_sb+0x139/0x14f [ext4]
[361.216048] [<ffffffffa01538ce>] ? ext4_da_write_begin+0xd4/0x1be [ext4]
[361.216056] [<ffffffffa01557ac>] ? ext4_da_write_end+0x1f1/0x232 [ext4]
[361.216061] [<ffffffff810b4f8b>] ? generic_file_buffered_write+0x10f/0x259
[361.216067] [<ffffffff810b5dc8>] ? __generic_file_aio_write+0x248/0x278
[361.216071] [<ffffffff8104b2ae>] ? current_fs_time+0x31/0x37
[361.216075] [<ffffffff81036628>] ? should_resched+0x5/0x23
[361.216079] [<ffffffff810b5e55>] ? generic_file_aio_write+0x5d/0xb5
[361.216087] [<ffffffffa014ebd4>] ? ext4_file_write+0x1e1/0x235 [ext4]
[361.216092] [<ffffffff810fa054>] ? do_sync_write+0xb4/0xec
[361.216096] [<ffffffff81164649>] ? security_file_permission+0x16/0x2d
[361.216100] [<ffffffff810fa745>] ? vfs_write+0xa2/0xe9
[361.216104] [<ffffffff810fa922>] ? sys_write+0x45/0x6b
[361.216108] [<ffffffff81354212>] ? system_call_fastpath+0x16/0x1b
[535.962198] megasas: reset: Stopping HBA.
[535.962362] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[535.962510] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[535.962656] sd 0:2:0:0: Device offlined - not ready after error recovery
[535.963023] sd 0:2:0:0: Device offlined - not ready after error recovery
[535.963030] sd 0:2:0:0: [sda] Unhandled error code
[535.963032] sd 0:2:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[535.963036] sd 0:2:0:0: [sda] CDB: Write(10): 2a 00 00 be a8 00 00 01 00 00
[535.967839] EXT4-fs warning (device sda1): ext4_end_bio:250: I/O error writing to inode 424345603 (offset 5876219904 size 135168 starting block 1561888)
[535.967996] sd 0:2:0:0: [sda] Unhandled error code
16871 次点击
所在节点    DevOps
19 条回复
niseter
2013-12-22 17:36:08 +08:00
C7 都出错,LZ换数据线吧。

默默问一句:8i是怎么连12块硬盘的?
fuxkcsdn
2013-12-22 18:03:21 +08:00
@niseter
用的是24盘位的服务器机箱

C7错误我觉得应该是阵列出问题时导致的,因为我发现阵列offline的时候,都会有某个硬盘的灯在常亮
可能是刚好写入那个硬盘的时候offline,而且每次仅有一个会常亮。
niseter
2013-12-22 18:31:39 +08:00
@fuxkcsdn 不是啊,9261-8i我怎么记得只能接8个硬盘呢?

还有LZ的硬盘使用环境怎样?
tititake
2013-12-22 18:36:30 +08:00
用megacli工具检查下看看
fuxkcsdn
2013-12-22 19:07:29 +08:00
@niseter
我的机箱是有带背板的,只要连接背板和RAID卡就可以让9261-8i带24个硬盘,甚至126个硬盘了(9261-8i官方说明,最高支持126个硬盘)

全套服务器都是新买的,才买来没2个礼拜就出这问题(加上换RAID卡的时间,就是一个月左右了)
从开始到出问题,也就写入5T左右的大数据(其实就是我之前下载的电影啦....),而且是从其他硬盘拷贝资料到阵列里而已,然后就大概一个礼拜左右没去动它了(正常关机,断电)。即使是拷贝的时候,硬盘也保持在30度左右而已(机箱自带的8038风扇超给力,不过也超大声,像发动机一样)

@tititake
offline的时候,用megacli工具检测会提示硬件不存在(类似,现在还在检查硬盘,用不了)
halfbloodrock
2013-12-22 20:20:12 +08:00
我感觉是有点像供电不足,12块盘瞬时电流也是很高的。
Marble
2013-12-22 23:30:38 +08:00
楼上说的有道理,看看你的背板最大能供多大的电流,除以12算一下能不能达到硬盘的规格
另外,RAID是可以一边初始化一边写入数据的
fuxkcsdn
2013-12-23 01:07:26 +08:00
@halfbloodrock
@Marble
买的机箱是supermicro专门用来diy存储服务器用的机箱的,电源是机箱自带的1400W的冗余电源
Marble
2013-12-23 08:36:51 +08:00
尽可能升级RAID卡和硬盘的firmware到最新版本,排除兼容性的问题,不行的话,只能交叉验证硬件问题了
fuxkcsdn
2013-12-23 11:24:47 +08:00
@Marble
RAID卡的固件已经是升级到最新版本了,硬盘倒是没升级,因为想说出厂日期都是今年10月份,应该都是最新的,等会看看
Marble
2013-12-23 15:27:39 +08:00
@fuxkcsdn 另外用 @tititake 提到的megacli看一下phy error count, 可以看到是不是信号完整性的问题
megacli -PhyErrorCounters -a0
fuxkcsdn
2013-12-23 17:06:43 +08:00
@Marble

应该也不是信号完整性的问题哦
我刚开始以为是驱动问题,因为Debian 7我没安装RAID卡驱动,装完系统自动识别到,所以我就没再安装(官方提供的驱动也只说支持Debian 603)
但早上安装了Windows 2008 R2测试,RAID卡的驱动也是LSI官方最新的驱动,也是一写入数据就马上offline
offline的时候,MegaRAID Storage Manager软件里连RAID卡都看不到了,所以这应该不可能是硬盘问题吧??如果是硬盘问题,应该最多就是某个阵列掉线吧??

C:\Users\Administrator>megacli -phyerrorcounters -a0

Adapter #0

================
Phy No: 0
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 1
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 2
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 3
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 4
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 5
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 6
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0

Phy No: 7
Invalid DWord Count : 0
Running Disparity Error Count : 0
Loss of DWord Synch Count : 0
Phy Reset problem Count : 0


Exit Code: 0x00
fuxkcsdn
2013-12-23 17:32:19 +08:00
@Marble
话说有什么办法可以一次检查多块硬盘是否有问题吗?
因为这都是第二块RAID卡了,难道真的那么倒霉连续碰到2块有问题的卡??
Marble
2013-12-23 22:39:51 +08:00
@fuxkcsdn 对了,你的系统是带expander的,所以你这边看到的是RAID卡到expander的情况,要看expander后面的HDD情况还得找找看是什么参数
Marble
2013-12-23 22:43:52 +08:00
RAID卡不见了是因为driver侦测到有错误reset HBA了,这个信息从上面Debian的系统log里面可以看到
fuxkcsdn
2013-12-23 22:57:03 +08:00
@Marble
expander是LSI-SAS2X36

Enclosure 1:
Device ID : 16
Number of Slots : 24
Number of Power Supplies : 2
Number of Fans : 5
Number of Temperature Sensors : 1
Number of Alarms : 0
Number of SIM Modules : 0
Number of Physical Drives : 8
Status : Normal
Position : 1
Connector Name : Port 0 - 3
Enclosure type : SES
FRU Part Number : N/A
Enclosure Serial Number : N/A
ESM Serial Number : N/A
Enclosure Zoning Mode : N/A
Partner Device Id : 65535

Inquiry data :
Vendor Identification : LSI
Product Identification : SAS2X36
Product Revision Level : 0e12
Vendor Specific : x36-55.14.18.0
fuxkcsdn
2013-12-23 23:00:59 +08:00
@Marble
话说在Windows下检测出C7警告的硬盘,在Linux下用smartctl检测则是出现这个错误信息
这信息是啥意思??

SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 72 hours (3 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 95 6b 4e 04 04 Error: ICRC, ABRT at LBA = 0x04044e6b = 67391083

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 00 00 4e 04 40 00 00:14:39.085 WRITE FPDMA QUEUED
61 00 00 00 4d 04 40 00 00:14:39.084 WRITE FPDMA QUEUED
61 00 00 00 47 04 40 00 00:14:39.082 WRITE FPDMA QUEUED
61 00 28 00 4c 04 40 00 00:14:39.078 WRITE FPDMA QUEUED
61 00 20 00 4b 04 40 00 00:14:39.078 WRITE FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 10 hours (0 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 d3 2d d8 5c 09 Error: ICRC, ABRT at LBA = 0x095cd82d = 157079597

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 00 00 d8 5c 40 00 01:05:23.837 WRITE FPDMA QUEUED
61 00 00 00 d7 5c 40 00 01:05:23.836 WRITE FPDMA QUEUED
61 00 00 00 d6 5c 40 00 01:05:23.835 WRITE FPDMA QUEUED
61 00 00 00 d5 5c 40 00 01:05:23.834 WRITE FPDMA QUEUED
61 00 00 00 d4 5c 40 00 01:05:23.833 WRITE FPDMA QUEUED
Marble
2013-12-26 16:20:38 +08:00
出现smart error应该是硬盘的可能性比较大了, 试着把有问题的盘低格一下, 如果还不行的话就节哀吧-_-!!
fuxkcsdn
2013-12-26 16:43:27 +08:00
@Marble
但是这个错误是RAID卡出错时导致的哦
我原本硬盘smartctl检查都没问题的,然后RAID卡出错的时候,必定有一块硬盘的灯是常量的,然后再用smartctl检查的话,那块硬盘就会出现那个错误,放到WIN里用HD TUNE检查的话,就是C7错误

我昨天尝试把所有硬盘都一一接到主板的SATA接口上,然后检查坏道什么的,也尝试格式化后写入读取测试,都没问题

现在再次把RAID卡寄回去保修了,之前换的那个没问说到底卡有没有问题,这次寄回去有要求给答复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/94151

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX