V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
kyonn
V2EX  ›  问与答

咨询个硬盘 smartd 工具的问题.

  •  
  •   kyonn · 97 天前 · 614 次点击
    这是一个创建于 97 天前的主题,其中的信息可能已经有所发展或是发生改变。

    安装 smartd 工具后, 主板上插入一块 Current_Pending_Sector 报错的硬盘, smartd 无法上报错误, 不会发送邮件.

    已经排查如下几个方面:

    1. smartd.conf 中使用 -M test 参数测试过, 邮件发送功能是正常的.
    2. 确认是 smartd 运行 short test 时, 虽然读到 Current_Pending_Sector 异常, 没有触发异常流程, 没有触发邮件发送.
    3. 这块坏硬盘之前在 OMV5 上能正确识别并发送告警邮件. 现在更换到新主板了, 操作系统是 debian12.

    /etc/smartd.conf

    /dev/sdd -a -o on -S on -n standby,q -T permissive -s (S/../.././09|L/../01/./04) -W 0,50,55 -m [email protected] -M exec /usr/share/smartmontools/smartd-runner
    

    smartd 运行 short test 时 syslog 的报错信息:

    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], 8 Offline uncorrectable sectors
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], previous self-test completed with error (read test element)
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], Self-Test Log error count increased from 9 to 10
    Jan 22 10:00:01 debian12 CRON[6148]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
    
    

    smartctl -l selftest /dev/sdd

    smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-17-amd64] (local build)
    Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed: read failure       90%     21653         792577336
    # 2  Short offline       Completed: read failure       90%     21647         792577336
    # 3  Short offline       Completed: read failure       90%     21645         792577336
    # 4  Short offline       Completed: read failure       90%     21644         792577336
    # 5  Short offline       Completed: read failure       90%     21642         792577336
    # 6  Short offline       Completed: read failure       90%     21639         792577336
    # 7  Extended offline    Completed: read failure       90%     21639         792577336
    # 8  Short offline       Completed: read failure       90%     21639         792577336
    # 9  Short offline       Completed: read failure       90%     21586         792577336
    

    有没有配置过 smartd 的大佬给个排查思路, 或者推荐别的类似软件, 能提供类似 omv 的 smart 监控和邮件推送功能. 整个 omv 太重了, 不想再安装.

    完整命令信息: smartctl -a /dev/sdd

    smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-17-amd64] (local build)
    Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.14 (AF)
    Device Model:     ST2000DM001-1ER164
    Serial Number:    W56012C5
    LU WWN Device Id: 5 000c50 09b3a804f
    Firmware Version: CC26
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database 7.3/5577
    ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Mon Jan 22 10:45:42 2024 CST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x00)	Offline data collection activity
    					was never started.
    					Auto Offline Data Collection: Disabled.
    Self-test execution status:      ( 121)	The previous self-test completed having
    					the read element of the test failed.
    Total time to complete Offline 
    data collection: 		(   80) seconds.
    Offline data collection
    capabilities: 			 (0x73) SMART execute Offline immediate.
    					Auto Offline data collection on/off support.
    					Suspend Offline collection upon new
    					command.
    					No Offline surface scan supported.
    					Self-test supported.
    					Conveyance Self-test supported.
    					Selective Self-test supported.
    SMART capabilities:            (0x0003)	Saves SMART data before entering
    					power-saving mode.
    					Supports SMART auto save timer.
    Error logging capability:        (0x01)	Error logging supported.
    					General Purpose Logging supported.
    Short self-test routine 
    recommended polling time: 	 (   1) minutes.
    Extended self-test routine
    recommended polling time: 	 ( 207) minutes.
    Conveyance self-test routine
    recommended polling time: 	 (   2) minutes.
    SCT capabilities: 	       (0x1085)	SCT Status supported.
    
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   117   088   006    Pre-fail  Always       -       151185992
      3 Spin_Up_Time            0x0003   097   094   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1370
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       82712512
      9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21655
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1262
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   085   085   000    Old_age   Always       -       15
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       5 5 5
    189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
    190 Airflow_Temperature_Cel 0x0022   071   056   045    Old_age   Always       -       29 (Min/Max 14/30)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       187
    193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       20532
    194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (0 7 0 0 0)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       20310h+10m+19.082s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       8742325499
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       146195076257
    
    SMART Error Log Version: 1
    ATA Error Count: 15 (device log contains only the most recent five errors)
    	CR = Command Register [HEX]
    	FR = Features Register [HEX]
    	SC = Sector Count Register [HEX]
    	SN = Sector Number Register [HEX]
    	CL = Cylinder Low Register [HEX]
    	CH = Cylinder High Register [HEX]
    	DH = Device/Head Register [HEX]
    	DC = Device Command Register [HEX]
    	ER = Error register [HEX]
    	ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 15 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:54:51.942  READ FPDMA QUEUED
      61 00 78 ff ff ff 4f 00  36d+21:54:49.230  WRITE FPDMA QUEUED
      60 00 48 ff ff ff 4f 00  36d+21:54:48.864  READ FPDMA QUEUED
      60 00 b0 ff ff ff 4f 00  36d+21:54:48.864  READ FPDMA QUEUED
      60 00 48 ff ff ff 4f 00  36d+21:54:48.862  READ FPDMA QUEUED
    
    Error 14 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:54:01.959  READ FPDMA QUEUED
      61 00 08 00 22 06 40 00  36d+21:54:01.958  WRITE FPDMA QUEUED
      ef 10 02 00 00 00 a0 00  36d+21:54:01.948  SET FEATURES [Enable SATA feature]
      27 00 00 00 00 00 e0 00  36d+21:54:01.921  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
      ec 00 00 00 00 00 a0 00  36d+21:54:01.921  IDENTIFY DEVICE
    
    Error 13 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:53:58.059  READ FPDMA QUEUED
      61 00 08 ff ff ff 4f 00  36d+21:53:58.046  WRITE FPDMA QUEUED
      61 00 08 00 22 06 40 00  36d+21:53:58.035  WRITE FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  36d+21:53:58.015  WRITE FPDMA QUEUED
      ef 10 02 00 00 00 a0 00  36d+21:53:58.005  SET FEATURES [Enable SATA feature]
    
    Error 12 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:53:54.205  READ FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  36d+21:53:54.205  WRITE FPDMA QUEUED
      61 00 08 e8 21 06 40 00  36d+21:53:54.204  WRITE FPDMA QUEUED
      ef 10 02 00 00 00 a0 00  36d+21:53:54.193  SET FEATURES [Enable SATA feature]
      27 00 00 00 00 00 e0 00  36d+21:53:54.167  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
    
    Error 11 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 08 e8 21 06 40 00  36d+21:53:53.926  WRITE FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  36d+21:53:53.671  WRITE FPDMA QUEUED
      60 00 08 ff ff ff 4f 00  36d+21:53:50.339  READ FPDMA QUEUED
      60 00 18 ff ff ff 4f 00  36d+21:53:50.279  READ FPDMA QUEUED
      60 00 20 ff ff ff 4f 00  36d+21:53:50.279  READ FPDMA QUEUED
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed: read failure       90%     21653         792577336
    # 2  Short offline       Completed: read failure       90%     21647         792577336
    # 3  Short offline       Completed: read failure       90%     21645         792577336
    # 4  Short offline       Completed: read failure       90%     21644         792577336
    # 5  Short offline       Completed: read failure       90%     21642         792577336
    # 6  Short offline       Completed: read failure       90%     21639         792577336
    # 7  Extended offline    Completed: read failure       90%     21639         792577336
    # 8  Short offline       Completed: read failure       90%     21639         792577336
    # 9  Short offline       Completed: read failure       90%     21586         792577336
    #10  Extended offline    Completed: read failure       90%     21546         792577336
    #11  Short offline       Completed without error       00%     21418         -
    #12  Short offline       Completed without error       00%     21250         -
    #13  Short offline       Completed without error       00%     21082         -
    #14  Short offline       Completed without error       00%     20914         -
    #15  Short offline       Completed without error       00%     20746         -
    #16  Short offline       Completed without error       00%     20578         -
    #17  Short offline       Completed without error       00%     20410         -
    #18  Short offline       Completed without error       00%     20242         -
    #19  Short offline       Completed without error       00%     20074         -
    #20  Short offline       Completed without error       00%     19906         -
    #21  Short offline       Completed without error       00%     19738         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    
    
    9 条回复    2024-01-25 20:38:45 +08:00
    julyclyde
        1
    julyclyde  
       97 天前
    好像只有状态变化的时候才会通知吧
    你这个盘,从 smartd 第一次见它开始,状态就没变过啊,一直都是坏的
    julyclyde
        2
    julyclyde  
       97 天前
    2949 if (oldc<newc) {
    2950 // increase in error count
    2951 PrintOut(LOG_CRIT, "Device: %s, Self-Test Log error count increased from %d to %d\n",
    2952 name, oldc, newc);
    2953 MailWarning(cfg, state, 3, "Device: %s, Self-Test Log error count increased from %d to %d",
    2954 name, oldc, newc);
    2955 state.must_write = true;
    2956 }
    2957 else if (newc > 0 && oldh != newh) {
    2958 // more recent error
    2959 // a 'more recent' error might actually be a smaller hour number,
    2960 // if the hour number has wrapped.
    2961 // There's still a bug here. You might just happen to run a new test
    2962 // exactly 32768 hours after the previous failure, and have run exactly
    2963 // 20 tests between the two, in which case smartd will miss the
    2964 // new failure.
    2965 PrintOut(LOG_CRIT, "Device: %s, new Self-Test Log error at hour timestamp %d\n",
    2966 name, newh);
    2967 MailWarning(cfg, state, 3, "Device: %s, new Self-Test Log error at hour timestamp %d",
    2968 name, newh);
    2969 state.must_write = true;
    2970 }

    看代码里有这些内容

    调用 MailWarning 的几处,都有类似的判断

    我觉得你可以删掉 state 文件再试试看?
    kyonn
        3
    kyonn  
    OP
       97 天前
    @julyclyde 感谢, 跟这个条件多半有关系, 又测试了一段时间, 当 error count 增加时能正常触发异常流程发送邮件的. 就是不知道有没有配置能监控几个关键指标异常时一直发送邮件, 而不是产生变化才发送.

    删除 state 文件没有作用. 我再找找其他配置, 看能不能在不改代码的前提下修改下这个行为.
    julyclyde
        4
    julyclyde  
       97 天前
    @kyonn 你还敢再继续测试……
    赶紧备份吧
    搞不清楚 smartd 怎么工作的其实都是其次了,保障数据可用性完整性是第一优先级
    kyonn
        5
    kyonn  
    OP
       97 天前
    @julyclyde 这个坏盘的数据早就转移了, 现在留着纯粹是当测试盘验证 smartd 工具配置的对不对.
    julyclyde
        6
    julyclyde  
       97 天前
    @kyonn 那你慢慢玩,搞明白了快写文章
    教我!
    dbak
        7
    dbak  
       95 天前
    kyonn
        8
    kyonn  
    OP
       94 天前   ❤️ 1
    @julyclyde 经过验证, 删除 state 文件是有效的, 只不过要先停止 smartd 服务, 再删除. 因为这个服务在停止前会先保存一次 state......

    删除 state 文件再重新跑 smartd 服务, 是可以检测到坏盘并发送邮件的.
    kyonn
        9
    kyonn  
    OP
       94 天前
    @dbak 感谢推荐, 瞄了眼是个类似 netdata 的监控工具, 不仅仅有监控硬盘的功能. 已经搞定这个问题了... 下次有机会再测试下这个软件.
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   我们的愿景   ·   实用小工具   ·   996 人在线   最高记录 6543   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 25ms · UTC 19:25 · PVG 03:25 · LAX 12:25 · JFK 15:25
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.