咨询个硬盘 smartd 工具的问题.

113 天前

安装 smartd 工具后, 主板上插入一块 Current_Pending_Sector 报错的硬盘, smartd 无法上报错误, 不会发送邮件.


  1. smartd.conf 中使用 -M test 参数测试过, 邮件发送功能是正常的.
  2. 确认是 smartd 运行 short test 时, 虽然读到 Current_Pending_Sector 异常, 没有触发异常流程, 没有触发邮件发送.
  3. 这块坏硬盘之前在 OMV5 上能正确识别并发送告警邮件. 现在更换到新主板了, 操作系统是 debian12.


/dev/sdd -a -o on -S on -n standby,q -T permissive -s (S/../.././09|L/../01/./04) -W 0,50,55 -m xx@xxxx.com -M exec /usr/share/smartmontools/smartd-runner

smartd 运行 short test 时 syslog 的报错信息:

Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors
Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], 8 Offline uncorrectable sectors
Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], previous self-test completed with error (read test element)
Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], Self-Test Log error count increased from 9 to 10
Jan 22 10:00:01 debian12 CRON[6148]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

smartctl -l selftest /dev/sdd

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-17-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     21653         792577336
# 2  Short offline       Completed: read failure       90%     21647         792577336
# 3  Short offline       Completed: read failure       90%     21645         792577336
# 4  Short offline       Completed: read failure       90%     21644         792577336
# 5  Short offline       Completed: read failure       90%     21642         792577336
# 6  Short offline       Completed: read failure       90%     21639         792577336
# 7  Extended offline    Completed: read failure       90%     21639         792577336
# 8  Short offline       Completed: read failure       90%     21639         792577336
# 9  Short offline       Completed: read failure       90%     21586         792577336

有没有配置过 smartd 的大佬给个排查思路, 或者推荐别的类似软件, 能提供类似 omv 的 smart 监控和邮件推送功能. 整个 omv 太重了, 不想再安装.

完整命令信息: smartctl -a /dev/sdd

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-17-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1ER164
Serial Number:    W56012C5
LU WWN Device Id: 5 000c50 09b3a804f
Firmware Version: CC26
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5577
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jan 22 10:45:42 2024 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(   80) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 207) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000f   117   088   006    Pre-fail  Always       -       151185992
  3 Spin_Up_Time            0x0003   097   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1370
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       82712512
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21655
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1262
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   085   085   000    Old_age   Always       -       15
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       5 5 5
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   056   045    Old_age   Always       -       29 (Min/Max 14/30)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       187
193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       20532
194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (0 7 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       20310h+10m+19.082s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       8742325499
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       146195076257

SMART Error Log Version: 1
ATA Error Count: 15 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 15 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  36d+21:54:51.942  READ FPDMA QUEUED
  61 00 78 ff ff ff 4f 00  36d+21:54:49.230  WRITE FPDMA QUEUED
  60 00 48 ff ff ff 4f 00  36d+21:54:48.864  READ FPDMA QUEUED
  60 00 b0 ff ff ff 4f 00  36d+21:54:48.864  READ FPDMA QUEUED
  60 00 48 ff ff ff 4f 00  36d+21:54:48.862  READ FPDMA QUEUED

Error 14 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  36d+21:54:01.959  READ FPDMA QUEUED
  61 00 08 00 22 06 40 00  36d+21:54:01.958  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  36d+21:54:01.948  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00  36d+21:54:01.921  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00  36d+21:54:01.921  IDENTIFY DEVICE

Error 13 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  36d+21:53:58.059  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  36d+21:53:58.046  WRITE FPDMA QUEUED
  61 00 08 00 22 06 40 00  36d+21:53:58.035  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00  36d+21:53:58.015  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  36d+21:53:58.005  SET FEATURES [Enable SATA feature]

Error 12 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  36d+21:53:54.205  READ FPDMA QUEUED
  61 00 10 ff ff ff 4f 00  36d+21:53:54.205  WRITE FPDMA QUEUED
  61 00 08 e8 21 06 40 00  36d+21:53:54.204  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  36d+21:53:54.193  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00  36d+21:53:54.167  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 11 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 e8 21 06 40 00  36d+21:53:53.926  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00  36d+21:53:53.671  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  36d+21:53:50.339  READ FPDMA QUEUED
  60 00 18 ff ff ff 4f 00  36d+21:53:50.279  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00  36d+21:53:50.279  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     21653         792577336
# 2  Short offline       Completed: read failure       90%     21647         792577336
# 3  Short offline       Completed: read failure       90%     21645         792577336
# 4  Short offline       Completed: read failure       90%     21644         792577336
# 5  Short offline       Completed: read failure       90%     21642         792577336
# 6  Short offline       Completed: read failure       90%     21639         792577336
# 7  Extended offline    Completed: read failure       90%     21639         792577336
# 8  Short offline       Completed: read failure       90%     21639         792577336
# 9  Short offline       Completed: read failure       90%     21586         792577336
#10  Extended offline    Completed: read failure       90%     21546         792577336
#11  Short offline       Completed without error       00%     21418         -
#12  Short offline       Completed without error       00%     21250         -
#13  Short offline       Completed without error       00%     21082         -
#14  Short offline       Completed without error       00%     20914         -
#15  Short offline       Completed without error       00%     20746         -
#16  Short offline       Completed without error       00%     20578         -
#17  Short offline       Completed without error       00%     20410         -
#18  Short offline       Completed without error       00%     20242         -
#19  Short offline       Completed without error       00%     20074         -
#20  Short offline       Completed without error       00%     19906         -
#21  Short offline       Completed without error       00%     19738         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

113 天前
你这个盘,从 smartd 第一次见它开始,状态就没变过啊,一直都是坏的
113 天前
2949 if (oldc<newc) {
2950 // increase in error count
2951 PrintOut(LOG_CRIT, "Device: %s, Self-Test Log error count increased from %d to %d\n",
2952 name, oldc, newc);
2953 MailWarning(cfg, state, 3, "Device: %s, Self-Test Log error count increased from %d to %d",
2954 name, oldc, newc);
2955 state.must_write = true;
2956 }
2957 else if (newc > 0 && oldh != newh) {
2958 // more recent error
2959 // a 'more recent' error might actually be a smaller hour number,
2960 // if the hour number has wrapped.
2961 // There's still a bug here. You might just happen to run a new test
2962 // exactly 32768 hours after the previous failure, and have run exactly
2963 // 20 tests between the two, in which case smartd will miss the
2964 // new failure.
2965 PrintOut(LOG_CRIT, "Device: %s, new Self-Test Log error at hour timestamp %d\n",
2966 name, newh);
2967 MailWarning(cfg, state, 3, "Device: %s, new Self-Test Log error at hour timestamp %d",
2968 name, newh);
2969 state.must_write = true;
2970 }


调用 MailWarning 的几处,都有类似的判断

我觉得你可以删掉 state 文件再试试看?
113 天前
@julyclyde 感谢, 跟这个条件多半有关系, 又测试了一段时间, 当 error count 增加时能正常触发异常流程发送邮件的. 就是不知道有没有配置能监控几个关键指标异常时一直发送邮件, 而不是产生变化才发送.

删除 state 文件没有作用. 我再找找其他配置, 看能不能在不改代码的前提下修改下这个行为.
113 天前
@kyonn 你还敢再继续测试……
搞不清楚 smartd 怎么工作的其实都是其次了,保障数据可用性完整性是第一优先级
113 天前
@julyclyde 这个坏盘的数据早就转移了, 现在留着纯粹是当测试盘验证 smartd 工具配置的对不对.
113 天前
@kyonn 那你慢慢玩,搞明白了快写文章
111 天前
109 天前
@julyclyde 经过验证, 删除 state 文件是有效的, 只不过要先停止 smartd 服务, 再删除. 因为这个服务在停止前会先保存一次 state......

删除 state 文件再重新跑 smartd 服务, 是可以检测到坏盘并发送邮件的.
109 天前
@dbak 感谢推荐, 瞄了眼是个类似 netdata 的监控工具, 不仅仅有监控硬盘的功能. 已经搞定这个问题了... 下次有机会再测试下这个软件.

