Storage and Filesystem
Michael Tsai 2016/03/16
1
Storage Hardware
•
Magnetic Tape
•
Optical Disks (CD, DVD, Blue-ray)
•
Hard Drive
•
SSD
185 TB (Sony)
25/50/100/128 TB 8 TB (Seagate)
1 TB (SanDisk)
2HD v.s. SSD
HD SSD
Size Terabytes Gigabytes
Random access
time 8 ms 0.25 ms
Sequential read 100 MB/s 560 MB/s
Random read
75-100 IOPS (7,200 rpm) 175-210 IOPS (15,000 rpm)
~ 2 MB/s
100,000 4K IOPS
> 30 MB/s
Cost USD 0.05 / GB USD 0.38 / GB
Limited writes No Yes
3
Hard Drive
•
History:
•
60 MB HD = $1,000 USD (~1990)
—>8TB HD = $399 USD (2015)
•
Sequential read: 500 kB/s —> 100 MB/s
4
Hard Drive
•
Delay: seek delay and rotational delay
•
> 10 MB/s —> < 5 MB/s
5
HD: other information
•
Unit comparison:
•
Disk: Gigabyte = 1,000,000,000 bytes
Memory: Gigabyte = 2^30 bytes (7% difference)
•
Failure statistics: (from 2007 Google Labs study)
•
2 years (6% average annual failure rate) 5 years (less than 75% to survive)
•
Operating temperature and drive activity are not correlated with failure rate
6
raw numbers, are likely to be good indicators of some- thing really bad with the drive. Filtering for spurious values reduced the sample set size by less than 0.1%.
3 Results
We now analyze the failure behavior of our fleet of disk drives using detailed monitoring data collected over a nine-month observation window. During this time we recorded failure events as well as all the available en- vironmental and activity data and most of the SMART parameters from the drives themselves. Failure informa- tion spanning a much longer interval (approximately five years) was also mined from an older repairs database.
All the results presented here were tested for their statis- tical significance using the appropriate tests.
3.1 Baseline Failure Rates
Figure 2 presents the average Annualized Failure Rates (AFR) for all drives in our study, aged zero to 5 years, and is derived from our older repairs database. The data are broken down by the age a drive was when it failed.
Note that this implies some overlap between the sample sets for the 3-month, 6-month, and 1-year ages, because a drive can reach its 3-month, 6-month and 1-year age all within the observation period. Beyond 1-year there is no more overlap.
While it may be tempting to read this graph as strictly failure rate with drive age, drive model factors are strongly mixed into these data as well. We tend to source a particular drive model only for a limited time (as new, more cost-effective models are constantly being intro- duced), so it is often the case that when we look at sets of drives of different ages we are also looking at a very different mix of models. Consequently, these data are not directly useful in understanding the effects of disk age on failure rates (the exception being the first three data points, which are dominated by a relatively stable mix of disk drive models). The graph is nevertheless a good way to provide a baseline characterization of fail- ures across our population. It is also useful for later studies in the paper, where we can judge how consistent the impact of a given parameter is across these diverse drive model groups. A consistent and noticeable impact across all groups indicates strongly that the signal being measured has a fundamentally powerful correlation with failures, given that it is observed across widely varying ages and models.
The observed range of AFRs (see Figure 2) varies from 1.7%, for drives that were in their first year of op- eration, to over 8.6%, observed in the 3-year old pop-
Figure 2: Annualized failure rates broken down by age groups
ulation. The higher baseline AFR for 3 and 4 year old drives is more strongly influenced by the underlying re- liability of the particular models in that vintage than by disk drive aging effects. It is interesting to note that our 3-month, 6-months and 1-year data points do seem to indicate a noticeable influence of infant mortality phe- nomena, with 1-year AFR dropping significantly from the AFR observed in the first three months.
3.2 Manufacturers, Models, and Vintages
Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18]. Our results do not contradict this fact. For example, Figure 2 changes significantly when we normalize failure rates per each drive model. Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data.
Interestingly, this does not change our conclusions. In contrast to age-related results, we note that all results shown in the rest of the paper are not affected signifi- cantly by the population mix. None of our SMART data results change significantly when normalized by drive model. The only exception is seek error rate, which is dependent on one specific drive manufacturer, as we dis- cuss in section 3.5.5.
3.3 Utilization
The literature generally refers to utilization metrics by employing the term duty cycle which unfortunately has no consistent and precise definition, but can be roughly characterized as the fraction of time a drive is active out of the total powered-on time. What is widely reported in the literature is that higher duty cycles affect disk drives negatively [4, 21].
7
It is difficult for us to arrive at a meaningful numer- ical utilization metric given that our measurements do not provide enough detail to derive what 100% utiliza- tion might be for any given disk model. We choose in- stead to measure utilization in terms of weekly averages of read/write bandwidth per drive. We categorize utiliza- tion in three levels: low, medium and high, correspond- ing respectively to the lowest 25th percentile, 50-75th percentiles and top 75th percentile. This categorization is performed for each drive model, since the maximum bandwidths have significant variability across drive fam- ilies. We note that using number of I/O operations and bytes transferred as utilization metrics provide very sim- ilar results. Figure 3 shows the impact of utilization on AFR across the different age groups.
Overall, we expected to notice a very strong and con- sistent correlation between high utilization and higher failure rates. However our results appear to paint a more complex picture. First, only very young and very old age groups appear to show the expected behavior. Af- ter the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high uti- lization ones.
One possible explanation for this behavior is the sur- vival of the fittest theory. It is possible that the fail- ure modes that are associated with higher utilization are more prominent early in the drive’s lifetime. If that is the case, the drives that survive the infant mortality phase are the least susceptible to that failure mode, and result in a population that is more robust with respect to varia- tions in utilization levels.
Another possible explanation is that previous obser- vations of high correlation between utilization and fail- ures has been based on extrapolations from manufactur- ers’ accelerated life experiments. Those experiments are likely to better model early life failure characteristics, and as such they agree with the trend we observe for the young age groups. It is possible, however, that longer term population studies could uncover a less pronounced effect later in a drive’s lifetime.
When we look at these results across individual mod- els we again see a complex pattern, with varying pat- terns of failure behavior across the three utilization lev- els. Taken as a whole, our data indicate a much weaker correlation between utilization levels and failures than previous work has suggested.
Figure 3: Utilization AFR
3.4 Temperature
Temperature is often quoted as the most important envi- ronmental factor affecting disk drive reliability. Previous studies have indicated that temperature deltas as low as 15C can nearly double disk drive failure rates [4]. Here we take temperature readings from the SMART records every few minutes during the entire 9-month window of observation and try to understand the correlation be- tween temperature levels and failure rates.
We have aggregated temperature readings in several different ways, including averages, maxima, fraction of time spent above a given temperature value, number of times a temperature threshold is crossed, and last tem- perature before failure. Here we report data on averages and note that other aggregation forms have shown sim- ilar trends and and therefore suggest the same conclu- sions.
We first look at the correlation between average tem- perature during the observation period and failure. Fig- ure 4 shows the distribution of drives with average tem- perature in increments of one degree and the correspond- ing annualized failure rates. The figure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates.
Only at very high temperatures is there a slight reversal of this trend.
Figure 5 looks at the average temperatures for differ- ent age groups. The distributions are in sync with Figure 4 showing a mostly flat failure rate at mid-range temper- atures and a modest increase at the low end of the tem- perature distribution. What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced.
Overall our experiments can confirm previously re-
8
Figure 4: Distribution of average temperatures and failures rates.
Figure 5: AFR for average drive temperature.
ported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher tempera- tures are not associated with higher failure rates. This is a fairly surprising result, which could indicate that data- center or server designers have more freedom than pre- viously thought when setting operating temperatures for equipment that contains disk drives. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.
3.5 SMART Data Analysis
We now look at the various self-monitoring signals that are available from virtually all of our disk drives through the SMART standard interface. Our analysis indicates that some signals appear to be more relevant to the study of failures than others. We first look at those in detail, and then list a summary of our findings for the remaining
ones. At the end of this section we discuss our results and reason about the usefulness of SMART parameters in obtaining predictive models for individual disk drive failures.
We present results in three forms. First we compare the AFR of drives with zero and non-zero counts for a given parameter, broken down by the same age groups as in figures 2 and 3. We also find it useful to plot the probability of survival of drives over the nine-month ob- servation window for different ranges of parameter val- ues. Finally, in addition to the graphs, we devise a sin- gle metric that could relay how relevant the values of a given SMART parameter are in predicting imminent failures. To that end, for each SMART parameter we look for thresholds that increased the probability of fail- ure in the next 60 days by at least a factor of 10 with respect to drives that have zero counts for that parame- ter. We report such Critical Thresholds whenever we are able to find them with high confidence (> 95%).
3.5.1 Scan Errors
Drives typically scan the disk surface in the background and report errors as they discover them. Large scan error counts can be indicative of surface defects, and therefore are believed to be indicative of lower reliability. In our population, fewer than 2% of the drives show scan errors and they are nearly uniformly spread across various disk models.
Figure 6 shows the AFR values of two groups of drives, those without scan errors and those with one or more. We plot bars across all age groups in which we have statistically significant data. We find that the group of drives with scan errors are ten times more likely to fail than the group with no errors. This effect is also noticed when we further break down the groups by disk model.
From Figure 8 we see a drastic and quick decrease in survival probability after the first scan error (left graph).
A little over 70% of the drives survive the first 8 months after their first scan error. The dashed lines represent the 95% confidence interval. The middle plot in Figure 8 separates the population in four age groups (in months), and shows an effect that is not visible in the AFR plots. It appears that scan errors affect the survival probability of young drives more dramatically very soon after the first scan error occurs, but after the first month the curve flat- tens out. Older drives, however, continue to see a steady decline in survival probability throughout the 8-month period. This behavior could be another manifestation of infant mortality phenomenon. The right graph in figure 8 looks at the effect of multiple scan errors. While drives with one error are more likely to fail than those with none, drives with multiple errors fail even more quickly.
9
SSD: solid state disks
• Read and write across banks of flash memory cells
• Individually slower than HD, but can use parallelism
• Write cycles limitation: 100,000 (typical)
• Firmware spreads the write across all pages
• Erasure is required before write (and slower than write)
• Clusters of adjacent pages HAVE TO be erased together
• Q: why is a SSD gets slower as we use it more?
10
11
2015
2016
SSD: other information
• Filesystem needs to be “SSD-aware”
• Let it know what blocks are no longer used (erased)
• Alignment: 512 byte blocks (SSD) vs 1~8 KiB (FS)
• SSD can only read/write 4 KiB pages
• Need to align the boundaries
• Write cycle limitation: when will it run out?
• 100 MB/s —> 150 GB SSD for continuous 4 years
12
Hardware Interface
•
(P)ATA (Advanced Technology Attachment) or IDE (Integrated Drive Electronics)
•
SATA (Serial ATA)
33,66,100,133 MB/s
1.5,3,6 Gb/s
(150, 300, 600 MB/s)
13
Other interfaces
• SCSI (5, 20, 40 MB/s)
SAS (serial attached SCSI) (3, 6 Gbps)
• Fiber Channel (1-40 Gb/s)
• USB: 2.0: 480 Mb/s / 3.0: 5 Gb/s / 3.1: 10 Gb/s FireWire (IEEE 1394): 400 and 800 MB/s
Thunderbolt: 1: 10 Gb/s / 2: 20 Gb/s
• PCI-express:
(2,4,8,16 Gb/s)
OCZ RevoDrive 350
14
Storage management layers
ptg
Peeling the onion: the software side of storage 221
Storage
Exhibit B Storage management layers
•
A partition is a fixed-size subsection of a storage device. Each partition has its own device file and acts much like an independent storage device.For efficiency, the same driver that handles the underlying device usu- ally implements partitioning. Most partitioning schemes consume a few blocks at the start of the device to record the ranges of blocks that make up each partition.
Partitioning is becoming something of a vestigial feature. Linux and Solaris drag it along primarily for compatibility with Windows-parti- tioned disks. HP-UX and AIX have largely done away with it in favor of logical volume management, though it’s still needed on Itanium-based HP-UX systems.
•
A RAID array (a redundant array of inexpensive/independent disks) combines multiple storage devices into one virtualized device. Depend- ing on how you set up the array, this configuration can increase perfor- mance (by reading or writing disks in parallel), increase reliability (by duplicating or parity-checking data across multiple disks), or both.RAID can be implemented by the operating system or by various types of hardware.
As the name suggests, RAID is typically conceived of as an aggregation of bare drives, but modern implementations let you use as a component of a RAID array anything that acts like a disk.
•
Volume groups and logical volumes are associated with logical volume managers (LVMs). These systems aggregate physical devices to form pools of storage called volume groups. The administrator can then sub- divide this pool into logical volumes in much the same way that disks of yore were divided into partitions. For example, a 750GB disk and a250GB disk could be aggregated into a 1TB volume group and then split into two 500GB logical volumes. At least one volume would include data blocks from both hard disks.
Filesystems, swap areas, database storage
Logical volumes
Partitions RAID arrays Volume groups
Storage devices
Arrow: “can be built on”
15
Storage “pieces”
•
Storage device: “disk” -
random access, block I/O, represented by a device file
•
Partition: fixed-size subsection of a storage device (古代遺跡: 為了跟windows使⽤用的儲存裝置相容)
•
RAID array:
increase performance, reliability, or both
•
Volume groups & logical volumes:
related to logical volume manager (LVM) / aggregation & split
16
Storage “pieces”
• Filesystem: mediates between
• blocks presented by a partition, RAID, or logical volume
• standard filesystem interface expected by programs
• path: e.g., /var/spool/mail
• File types, file permissions, etc.
• how the content of files are stored
• how the filesystem namespace is represented and searched on disk
• other “filesystems”: swap and database storage
17
Playground creation
• Easier to find resources: pick a popular distribution (or a distribution your friend uses)
• Virtual machine:
without modifying your current system
• Free options:
• VMWare Player (host: windows & linux)
• Oracle VirtualBox (host: windows, mac, and linux)
• apt-get install lvm2 (install some missing software)
• sudo command … (issue command as a superuser)
18
Get started
•
Install the hardware (disks) (or a virtual disk)
•
Check BIOS / dmesg
•
Look for device file in /dev
block device: /dev/sda parition: /dev/sda1
•
“parted -l” to show the information of all system disks
19
ptg
File types 147
The Filesystem
6.4 FILE TYPES
Most filesystem implementations define seven types of files. Even when develop- ers add something new and wonderful to the file tree (such as the process infor- mation under /proc), it must still be made to look like one of these seven types.
Table 6.2 Standard directories and their contents Pathname OSa Contents
/bin All Core operating system commandsb
/boot LS Kernel and files needed to load the kernel
/dev All Device entries for disks, printers, pseudo-terminals, etc.
/etc All Critical startup and configuration files /home All Default home directories for users
/kernel S Kernel components
/lib All Libraries, shared libraries, and parts of the C compiler /media LS Mount points for filesystems on removable media /mnt LSA Temporary mount points, mounts for removable media /opt All Optional software packages (not consistently used) /proc LSA Information about all running processes
/root LS Home directory of the superuser (often just /)
/sbin All Commands needed for minimal system operabilityc /stand H Stand-alone utilities, disk formatters, diagnostics, etc.
/tmp All Temporary files that may disappear between reboots /usr All Hierarchy of secondary files and commands
/usr/bin All Most commands and executable files /usr/include All Header files for compiling C programs
/usr/lib All Libraries; also, support files for standard programs /usr/lib64 L 64-bit libraries on 64-bit Linux distributions
/usr/local All Software you write or install; mirrors structure of /usr /usr/sbin All Less essential commands for administration and repair /usr/share All Items that might be common to multiple systems
/usr/share/man All On-line manual pages
/usr/src LSA Source code for nonlocal software (not widely used) /usr/tmp All More temporary space (preserved between reboots) /var All System-specific data and configuration files
/var/adm All Varies: logs, setup records, strange administrative bits /var/log LSA Various system log files
/var/spool All Spooling directories for printers, mail, etc.
/var/tmp All More temporary space (preserved between reboots)
a. L = Linux, S = Solaris, H = HP-UX, A = AIX.
b. On HP-UX and AIX, /bin is a symbolic link to /usr/bin.
c. The distinguishing characteristic of commands in /sbin is usually that they’re linked with “static” ver- sions of the system libraries and therefore don’t have many dependencies on other parts of the system.
ptg
File types 147
The Filesystem
6.4 FILE TYPES
Most filesystem implementations define seven types of files. Even when develop- ers add something new and wonderful to the file tree (such as the process infor- mation under /proc), it must still be made to look like one of these seven types.
Table 6.2 Standard directories and their contents
Pathname OSa Contents
/bin All Core operating system commandsb
/boot LS Kernel and files needed to load the kernel
/dev All Device entries for disks, printers, pseudo-terminals, etc.
/etc All Critical startup and configuration files /home All Default home directories for users
/kernel S Kernel components
/lib All Libraries, shared libraries, and parts of the C compiler /media LS Mount points for filesystems on removable media
/mnt LSA Temporary mount points, mounts for removable media /opt All Optional software packages (not consistently used) /proc LSA Information about all running processes
/root LS Home directory of the superuser (often just /)
/sbin All Commands needed for minimal system operabilityc /stand H Stand-alone utilities, disk formatters, diagnostics, etc.
/tmp All Temporary files that may disappear between reboots /usr All Hierarchy of secondary files and commands
/usr/bin All Most commands and executable files /usr/include All Header files for compiling C programs
/usr/lib All Libraries; also, support files for standard programs /usr/lib64 L 64-bit libraries on 64-bit Linux distributions
/usr/local All Software you write or install; mirrors structure of /usr /usr/sbin All Less essential commands for administration and repair /usr/share All Items that might be common to multiple systems
/usr/share/man All On-line manual pages
/usr/src LSA Source code for nonlocal software (not widely used) /usr/tmp All More temporary space (preserved between reboots) /var All System-specific data and configuration files
/var/adm All Varies: logs, setup records, strange administrative bits /var/log LSA Various system log files
/var/spool All Spooling directories for printers, mail, etc.
/var/tmp All More temporary space (preserved between reboots)
a. L = Linux, S = Solaris, H = HP-UX, A = AIX.
b. On HP-UX and AIX, /bin is a symbolic link to /usr/bin.
c. The distinguishing characteristic of commands in /sbin is usually that they’re linked with “static” ver- sions of the system libraries and therefore don’t have many dependencies on other parts of the system.
20
ptg
Installation verification at the hardware level 223
Storage
Exhibit C Traditional data disk partitioning scheme (Linux device names)
8.5 ATTACHMENT AND LOW-LEVEL MANAGEMENT OF DRIVES
The way a disk is attached to the system depends on the interface that is used. The rest is all mounting brackets and cabling. Fortunately, SAS and SATA connections are virtually idiot-proof.
For parallel SCSI, double-check that you have terminated both ends of the SCSI bus, that the cable length is less than the maximum appropriate for the SCSI vari- ant you are using, and that the new SCSI target number does not conflict with the controller or another device on the bus.
Even on hot-pluggable interfaces, it’s conservative to shut the system down before making hardware changes. Some older systems such as AIX default to doing de- vice configuration only at boot time, so the fact that the hardware is hot-pluggable may not translate into immediate visibility at the OS level. In the case of SATA interfaces, hot-pluggability is an implementation option. Some host adapters don’t support it.
Installation verification at the hardware level
After you install a new disk, check to make sure that the system acknowledges its existence at the lowest possible level. On a PC this is easy: the BIOS shows you IDE and SATA disks, and most SCSI cards have their own setup screen that you can invoke before the system boots.
On other types of hardware, you may have to let the system boot and check the diagnostic output from the kernel as it probes for devices. For example, one of our test systems showed the following messages for an older SCSI disk attached to a BusLogic SCSI host adapter.
scsi0 : BusLogic BT-948 scsi : 1 host.
Vendor: SEAGATE Model: ST446452W Rev: 0001
Type: Direct-Access ANSI SCSI revision: 02 Detected scsi disk sda at scsi0, channel 0, id 3, lun 0
scsi0: Target 3: Queue Depth 28, Asynchronous
SCSI device sda: hdwr sector=512 bytes. Sectors=91923356 [44884 MB] [44.9 GB]
Physical layer Partition
layer Filesystem
layer
/dev/sda1 /home
/dev/sda2 /opt
/dev/sda
label
/dev/sdb1 /spare
Hard disk 2 /dev/sdb
label
Hard disk 1
21
Partition & logical volume
•
Why? Easier to backup / Confine damage
•
Tips:
•
Have a backup root device and check if it works
•
Put /tmp on a separate filesystem (no backup / size limit)
•
Separate /var: log in /var easily fill up /
•
Splitting swap on multiple physical disks / add more swap when adding memory
22
Partition: other information
•
When is it used nowadays?
•
share a disk with windows
•
specify location on the disk
(outer cylinder is faster by 30%!)
•
create partitions of identical size (for RAID)
23
Partition: other information
•
MBR (windows-style) partition table
•
primary and extended partitions
•
OS is installed in primary partition
•
one partition is marked as “active” and boot loader looks for that partition
•
does not support disk > 2 TB
•
Max # of partitions: 4
24
Partition: other information
•
GPT: GUID partition table
•
support disk > 2TB
•
Windows Vista and versions afterwards support GPT disks for data, but need EFI firmware (new computers) to boot
•
Tools: gparted (GUI), parted (command-line), fdisk (does not support GPT)
25
Logical volume management
• Volume groups (VG): storage devices put into groups
• Logical volumes (LV): assign blocks in VG to LV
• Then LV, as a block device, is used by filesystem
• Powerful features:
• Move LV among physical devices
• Grow and shrink LV on the fly
• Snapshot
• Replace on-line drives
• Mirroring / stripping
26
Typical sequence
• sudo pvcreate /dev/sdb
(define /dev/sdb to be used)
(you can also use /dev/sdb2, for example, to use just partition 2 on sdb)
• sudo vgcreate hsinmu /dev/sdb
(put /dev/sdb into a new VG called hsinmu)
• sudo lvcreate -L 8G -n test_lv hsinmu
(create a 8G LV in hsinmu called test_lv)
• sudo mkfs -t ext4 /dev/hsinmu/test_lv (format the new LV as a ext4 filesystem)
• sudo mkdir /mnt/test_lv
• sudo mount /dev/hsinmu/test_lv /mnt/test_lv (掛載)
• df -h /mnt/test_lv (show information about a mount point)
27
Additional reading
•
Not covered today:
•
How to do volume snapshots
(create a copy-on-write duplicate of LV) lvcreate -L 8G -s -n snap hsinmu/test_lv
•
Resize the filesystem (lvresize, lvextend)
28
Filesystem
• Popular filesystems on Linux:
• Ext 2/3/4 (journaling after 3, better support for SSD in 4)
• BtrFS (Oracle, better performance - B-Tree FS for server file system, some ReiserFS pros added)
• ReiserFS / XFS / ZFS
Journaling filesystem
29
Filesystem: other info.
•
/etc/fstab: shows how and where to mount a filesystem at boot time
•
mount/umount /dev/XXX <dir>
(mount/unmount a filesystem)
(man mount to see -o options, e.g., read-only)
•
mkfs -t <format> /dev/XXX
(create a filesystem on /dev/XXX, i.e., format)
•
fsck /dev/XXX
(check and fix errors on filesystem on /dev/XXX)
30
In-class HW
•
Explain (1) the purpose (2) the working principle of the following RAID levels and (3) give each an example
(4) the maximum number of allowed disk failures.
•
RAID 0
•
RAID 1
•
RAID 5
•
RAID 10
31