Mr. Fixit's PC Upgrade and Repair
|
Have you heard of SMART Hard Drives? I am not suprised if you haven't. Well, the drives aren't as smart as the name implies. The ancronym is short for
thought to be a tool to predict and warn the user of an impending Drive failure. When a problem was detected, the SMART firmware in the HDD would alert
the user through the Host Software of the problem. The User can then copy the data to another drive before the drive failed and all data is lost.
Hard drive failures are classified into 2 catagories: Predictive Failures are determined by monitoring the performance of the drive. As the performance of the
drive diminish, the data can be used to predict when the drive will fail. Unexpected Drive Failures are the results of sudden mechanical failure, which accounts
for 60% of drive failures.
The first disk monitoring was developed by IBM, in 1992, to monitor key health parameters of the drive to determine if failure is imminent. The earliest version
was used in IBM's AS400 Servers called the Predictive Failure Analysis (PFA) technology and was limited to only a binary result, PASS or FAIL.
Compaq (now HP), Seagate, Quantum, and Conner developed IntelliSafe, which also monitored key factors to predict drive failures and could also
communicate to the OS. Compaq submitted the technology for standardization to the Small Form Factor committee in 1995 and was accepted by IBM,
Seagate, Quantum, Conner, and Western Digital - who at the time didn't have any sort of failure prediction system. The committee liked the approach by
IntelliSafe because of its flexibility. After the joint adoption of the IntelliSafe specification, the technology became known as SMART.
SMART went through 3 phases. The inital phase monitored 'Online' drive activities while it was in use. The second phase added the ability to monitor
additional operations while the drive was idle or 'Offline'. The final phase added failure prevention measures by having SMART relocate data in bad sectors to
good sectors, called ReMapping.
There are dozens of parameters that can be monitored by SMART, many of which are vendor specific. The parameters range from Read errors to Free Fall
Protection counts. Below lists some of these parameters found in today's Hard drives and there importance.
Power-On Hours (POH) is the value of the total count of hours (or minutes, or seconds, depending on the manufacturer) the drive is in the power-on state.
The life expectancy of a HDD in perfect condition running 24 hours a day 7 days a week is 5 years or 43,800 hours. On some pre-2005 drives, this value may
Temperature is the current HDD's internal temperature in degrees Celsius (°C). The idea operating temperature for HDDs is between 25°C (77°F) and 40°C
(104°F). Although HDDs can operate between 41°C (105°F and 50°C (122°F), active cooling is highly recommended.
Read Error Rate stores data related to the rate of hardware read errors that occurred when reading data from the disk surface. The raw value has
different structures for different vendors and is often not meaningful as a decimal number, however, the lower the better.
Throughput Performance is the overall performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that
there is a problem with the disk.
Spin-Up Time is the average time in milliseconds (ms) it takes the spindle to go from 0 RPM to full operational speed (i.e. 7200 RPM). The lower the number
the better. This attribute is not used in SSD.
Reallocated Sectors Count is an critical metric to monitor. It's the number of times the drive had to relocate data in a bad sector to a spare sector. This
process is known as remapping, and reallocated sectors are called "remaps". This allows a drive with bad sectors to continue operation; however, a drive
which has had any reallocations at all is significantly more likely to fail in the near future. It also affects performance. As the count of reallocated sectors
increases, the read/write speed become worse because the drive is forced to seek to the reserved area whenever a remap is accessed. The higher the value
the sooner the drive will fail.
Spin Retry Count is also an important parameter. This attribute stores a total count of the spin start attempts to reach operational speed if the first attempt
failed. An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
Current Pending Sector Count is the number of "unstable" sectors waiting to be remapped, because of unrecoverable read errors. If an unstable sector is
subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately since the
correct value cannot be read and so the value to remap is not known, but might become readable later. Instead, the drive's firmware remembers that the sector
needs to be remapped, and will remap it the next time it's written. However, some drives will not immediately remap such sectors when written. The drive will
first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good removing the need to remap the sector.
This is a serious problem, if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation,
then the drive will never remap these problem sectors.
Uncorrectable Sector Count is the total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of
the disk surface and/or problems in the mechanical subsystem.
Seek Time Performance is the average performance of seek operations. Seek Time refers to the time it takes the drive to find data from the instant it
receives the request. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.
Command Timeout is the count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above
zero, then most likely there will be some serious problems with the power supply or an oxidized data cable.
You may also find a Threshold Exceeds Condition (T.E.C) Date within the SMART report. This is an estimated Date the drive is expected to fail. The firmware
in the drive tracks the rate errors occur and predicts when the Drive may fail. If a date appears in the TEC field, the drive has calculated a failure is imminent. At
this time you should consider backing of your data. Keep in mind the date is only an estimate. The drive may fail before or after the predicted date.
Some SMART drives have options to perform tests. These tests are often found in BIOS or the UEFI. It allows the user to perform a SMART test on the drive
to determine if the drive is OK. When the OS crashes, a problem with the drive may be the cause.
The SHORT test checks the electrical and mechanical performance as well as the read performance of the disk. Electrical tests might include a test of buffer
RAM, a read/write circuitry test, or a test of the read/write head elements. Mechanical test includes seeking data tracks by testing the servo. Scans small parts
of the drive's surface (area is vendor-specific and there is a time limit on the test). Checks the list of pending sectors that may have read errors, and it usually
takes under two minutes.
The LONG/EXTENDED test is a more thorough version of the short self-test, scanning the entire disk surface with no time limit. This test usually takes several
hours, depending on the read/write speed of the drive and its storage capacity.
When looking at a SMART report, you will see the NAME of the attribute, It's Threshold limit, It's current value, it's Worst value, and it's current health status.
Now you may ask "Is the information accurate and can it be trusted?". The answer is NO. SMART has no standards or specifications except at the protocol
level, how the Data is transmitted to the system. This led to a wide range of interpretation of SMART codes and thresholds. The use of SMART varies a great
deal between hard drive manufactures. Every manufacture has there own set of rules to determine if their drive is failing. The technology was meant to warn
the user of an imminent drive failure so they can backup their data before the failure. Manufactures use the SMART data to determine the cause of the failure
to help build better drives.
There is no industry-wide software or hardware standards for S.M.A.R.T. data interchange. A manufacture only has to monitor one parameter in order to
'legally' call the HDD a SMART compatible drive. Some SMART enabled motherboards and related Software may not be able to correctly communicate with
the SMART firmware in the HDD using some interfaces. There are many ways to connect a HDD to your PC which makes it difficult to know if SMART reports
can be accessed correctly by the PC. For example, many external HDDs using USB, Fireware, and eSATA cannot send SMART data to the PC. Even
Operating Systems like Windows may not be able to see SMART on the HDD if it is part of a RAID system. Some programs require an Administrator account to
query the HDD's SMART report. The implementation of SMART is only in a handful of drive manufactures with only a few aspects being standardized to permit
compatibilty.
The technology may not be as reliable as you want but, SMART can provide clues of a pending drive failure......