二、 System Model and Problem Formulation
2.2 Localities of Reads and Writes
Zone
Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
……
Segment
Figure 3: The organization of a striping system: Relationship among bank, zone, segment, block, and page.
To manage data on banks, logical organization is adopted, as shown in Figure 3. Let a bank be partition into equal-size segments. A segment is of multiple flash-memory blocks. Let a zone refer to the collection of all the segments having the same physical offsets in banks. A stripe block is of multiple logical blocks of the emulated block device (i.e., sectors). All the stripe blocks are sequentially interleaved over banks. Reads and writes to the striping system are performed in terms of stripe blocks. A stripe block is no smaller than a flash-memory page, because a partial write to a flash-flash-memory pages is prohibited [3, 4]. In each segment, a fraction of flash-memory blocks are reserved for garbage collection and bad-block retirement. They are referred to as spare blocks. Address translation and garbage collection [1] are conducted internally in each segment.
The striping system has two operation modes, the on-line phase and the off-line phase. The striping system accepts requests in the on-off-line phase. Whenever the system is idle or very lightly loaded, possibly in the midnights, it enters the off-line phase for self-reorganization. In the off-line phase, activities of redun-dancy encoding and placement are conducted.
2.2 Localities of Reads and Writes
This section presents some motivating observations on realistic workloads of flash-memory storage systems. Because our striping system mainly aims at mass-storage devices, such as SSDs, that replace hard drives in mobile com-puters, the disk I/O workload of a typical mobile computer is analyzed. An arbitrary one-day fragment extracted from the disk traces collected in work [6]
is considered. Let a preliminary system geometry be as follows: There are 8 memory banks, the stripe-block size is 8 KB, and the zone size is 256 MB.
Figure 4(a) shows that some particular banks receive much more writes than others. That is because of the temporal localities of writes. Writes to some particular data frequently arrive. Among the writes, the majority is small writes, as reported in [6]. Because small writes do not span many banks, the banks in which hot data reside would be popular. The popular banks would delay the completion of large requests, as previously shown in Figure 2. The phenomenon would be exaggerated if garbage collection is involved. On the other hand, Figure 4(a) shows that reads evenly utilize every bank. It is because the I/O cache of the hot system largely weakens the temporal localities of reads.
4
Figure 4: Analyzing the characteristics of reads and writes in an one-day frag-ment of the one-month disk traces.
Instead, we found strong spatial localities among reads. Figure 4(b) shows the amount of data read from each zone. It shows that only a small number of zones receive many reads. In other words, data adjacent to the requested data are possibly to be accessed in the near future. Most of the reads are found sequential and bulk. A bulk read would spans many banks.
The temporal localities of writes and the spatial localities of reads arise a potential performance issue of our striping architecture. The completion of bulk reads is largely delayed by small writes. In the following sections, we shall discuss how erasure codes help to deal with this problem.
3 An Erasure-Code-Based Striping Scheme
3.1 Redundancy Allocation and Placement
Redundancy allocation and redundancy placement refer to what data should be added to redundancy and where the redundancy is placed, respectively. This section aims at policies to deal with the issues.
For the ease of presentation, some terminologies are defined in order. Let Zi
and Bistand for the i-th zone and the j-th bank, respectively. Let Si,j refers to the segment of zone Zi on bank Bj. A large request is striped as multiple sub-requests of stripe blocks. Let a stripe block storing original data be referred to as a data block. A stripe block storing data encoded by data blocks is referred to as a code block2. We propose to mix no code blocks with data blocks in the same segment. Let zones of data blocks be referred to as data zones. A small number of extra zones, referred to as code zones are added to the striping system for the storage of code blocks. Because the extra space costs of code zones should not be large, how the code blocks are allocated is a technical question.
For writes, a bank would have a deep request queue if any of its segments recently receives many writes. The bank can not response to newly arriving requests unless all the outstanding requests complete. Not all the banks suffer from the congestion because hot data are small. For reads, to service reads involves multiple banks because most of them are bulk and sequential. However, no matter which zone a large read goes, a bank congested by small writes would delay its completion.
2To avoid ambiguity, flash-memory blocks are explicitly termed.
5
Figure 5: Associating a popular data zone with a code zone. Note that only a small number of popular data zones need to be associated with code zones.
Suppose that we have n data zones and m code zones. During the on-line phase, the number of reads each zone receives is accumulated. Once the system enters the off-line phase, we propose to associate the first m most popular zones with code zones. The relation of code zones to data zones is one-to-one. As shown in Figure 5(a), suppose that data zone Z2 is found the most popular when the system enters the off-line phase. Code zone Z7 is then allocated to data zone Z2. Data in data zone Z2 are dispersed over banks for the flexibility of load balancing, and the dispersed information are stored in code zone Z7. An XOR-based coding scheme is proposed, as follows: Let B be the number of banks, and G ∈ Z+an integer. For i=0...B, segment S2,iis XOR’ed to segment S2,(i+G)%B and S2,(i+1+G)%B. As the example in Figure 5(a) shows, segments in Z7 are coded as follows:
S7,0 = S2,4⊕ S2,5 S7,4 = S2,0⊕ S2,1
S7,1 = S2,5⊕ S2,6 S7,5 = S2,1⊕ S2,2
S7,2 = S2,6⊕ S2,7 S7,6 = S2,2⊕ S2,3
S7,3 = S2,7⊕ S2,4 S7,7 = S2,3⊕ S2,0
The coding scheme is to regularly disperse a piece of data over two other banks. Thus the information can be found in three banks. The parameter G refers to the group size, which will be discussed in the later sections. If seg-ments {S2,0, S2,1, S2,2, S2,3} are involved by a read request and bank B2 hap-pens to be busy, there are many alternatives to fulfill the request. For example, {S2,0, S2,1, S7,6, S2,3}, {S2,0, S2,1, S7,5, S2,3}, or even {S7,4, S7,5, S7,6, S2,3}. The design of the coding scheme will further be explained in Appendix.
3.2 Request Scheduling
This section presents a scheduling algorithm that makes use of code blocks so as to improve the response of large reads.
Request scheduling needs to consider not only load balancing but also the correctness of the schedule. For example, in Figure 2, neither {a, b, a ⊕ b}
nor {a, b, a ⊕ b, c} are correct schedules. The former schedule fails to fulfill the request, and the latter one introduces extra traffic. As to load balancing,
6
Figure 6: (a) Sub-requests of all the data blocks and code blocks are scheduled.
(b) Unneeded sub-requests are removed. “GC” refers to garbage-collection ac-tivities.
in realistic workload, localities would change from time to time. Occasionally arriving writes to hot data may transiently congest a bank. Furthermore, it is hard to schedule garbage collection in advance because it is triggered on demand.
As a result, dramatic changes to the loads of banks happen from time to time.
To schedule sub-requests on lightly loaded banks upon the arrival of a request may not be a good choice.
We propose a lazy scheduling algorithm to deal with the problems. Upon the arrival of a read, we are not concerned with which banks should be used to service the read. Instead, sub-requests of all the data blocks and related code blocks are scheduled. On the completion of a sub-request, it is examined to see whether any other sub-request can be removed from queues. A sub-request can be removed if 1) the entire request is fulfilled, or 2) the sub-request is redundant.
Actually the two cases are similar. Figure 6(a) shows an example of the arrival of a large read R(a, b, c, d). To service the request, sub-requests of data blocks a, b, c, and d, code blocks a⊕b, b⊕c, c⊕d, and d⊕a are all scheduled. Figure 6(b) shows that, after a short period of time, sub-requests to a, d, and c⊕ d complete. Because d=(c⊕d)⊕c, sub-requests of c and d⊕a are removed as they provide no new information. Sub-requests b, b⊕c, and a⊕b remain. The read is fulfilled if any of them completes. After the fulfilment, all other remaining sub-requests can be removed. Without the lazy scheduling algorithm, on the arrival of R(a, b, c, d), sub-requests of a, d, a⊕b, and c⊕d may be scheduled.
Because garbage collection is triggered at bank B4, the schedule becomes a bad decision.
The decode procedure and the removal of sub-requests are closely related to each other. A graph-based decode procedure is adopted. Before sub-requests of all the data blocks and code blocks are scheduled, a Tanner graph [20] is constructed. Figure 5(b) shows the graph over blocks in zones Z2and Z7. The graph is bipartite. Data blocks and code blocks reside in the left-hand side as “message”. The right-hand side nodes are “constraints”, and a constraint
7
connects to a code block and all the data blocks that are XOR’ed to the code block. As Figure 5(b) shows, a constraint connects to A, B, and A⊕B. Whenever a sub-request of a data block or code block completes, the corresponding left-hand node and all the edges connected to the node are removed from the graph.
After the removal, if there is any constraint that is connected to only one edge, then the constraint is “resolved” and the only message it connects to can be decoded without being retrieved. In this case, the constraint and all its edges are removed. The sub-request of the removed message can be removed from the queues of the striping system. For example, as shown in Figure 5(b), if B and A⊕B are retrieved and their edges are removed, then the topmost constraint connects to message A only. Message A can be decoded since B⊕(A⊕B)=A.
As there is no need to retrieve A, the sub-request of A is removed from queues.
The decode procedure terminates when all constraints are resolved.
3.3 A Read-ahead Policy
The proposed coding scheme provides large requests with better flexibility in load balancing. This section presents a read-ahead policy. Conditional read ahead would help to convert a sequence of small reads into large requests.
For example, consider data blocks {x, y, z} and their code blocks {x ⊕ y, y ⊕ z, z ⊕ x}. Suppose that a read of {x, y} arrives. There are three different ways to handle the read: To read {x, y}, {x, x ⊕ y}, or {y, x ⊕ y}. Later a read of {z}
arrives. There is only one choice, to read {z}. If {x, y} are kept in buffer, then to read {z}, {y ⊕ z}, or {z ⊕ x} fulfills the request. There are either 3*1=3 or 3*3=9 ways to service the two requests. However, if a read {x, y, z} arrives, then there are 16 ways to service the read (e.g., {x, y, z}, {x, y, y ⊕ z}, {x, z, x ⊕ y}, {x, z, y ⊕z}, {x, x⊕z, y ⊕z}, etc). As the example shows, fragmented sequential reads largely limit the flexibility of request scheduling.
By analyzing the gathered disk traces, we found two potential problems. As shown in Figure 7(a), some sequential access patterns are severely fragmented as many small reads. The other case is that, as shown in Figure 7(b), even though the reads are large enough, the reads are not aligned on zone/group boundaries.
Since both the two cases are sequential access, a read-ahead mechanism can be adopted to convert the fragmented reads into bulk and aligned reads. The intention is to improve the flexibility of load balancing.
A read-ahead policy is proposed. The basic idea behind the policy is to capture spatial localities. If many data blocks have been sequentially requested, then it is highly possible that the adjacent data blocks will be accessed in the near future. It can be realized by using a counter and a threshold value. The counter accumulates if the newly arrival read continues the last requested data block. Otherwise the counter is reset. If it’s value is greater than the threshold, then the following requests are serviced in terms of groups. The threshold value is currently 128 sectors, which is the largest read size found in the gathered disk traces. For example, if the group size is 4, and a data block is requested from segment S3,0, then four adjacent data blocks are read from segments S3,0, S3,1, S3,2, and S3,3. The extra data are kept in a read-ahead buffer to service subsequent reads. If the stripe-block size is 8 KB, then the read-ahead buffer is only 8*4=32 KB.
The workload of the host system may have multiple accesses patterns mixed together because the host system is multiprogramming. To individually extract
8
(a) Reads do not span all the banks
(b) Reads span all the banks but are not aligned at zone boundaries
Figure 7: Two cases of a sequential access pattern: (a) The pattern is fragmented as many small reads, and (b) reads are not aligned on zone boundaries. A box stands for one segment.
sequential access patterns from workloads, a thread table is adopted. In our current design, the thread table is of twenty entries. The value should be no less than the number of concurrent threads that issue requests to the striping system. Each entry keeps a counter, the address of the last requested data block, and a read-ahead buffer. Upon the arrival of a read, all entries of the table is checked to see if the read continues the last data block of any entry. If so, the counter is increased and read ahead is performed whenever necessary.
Otherwise the oldest entry is replaced. However, the replacement policy is vulnerable, because random reads may scrub the entire thread table. To fix this, a sliding window is added. The sliding window keeps the addresses of twenty recently received reads. On the arrival of a read, it is checked against the window. If the read is found sequential, then it proceeds to the thread table for read ahead and table-entry replacement. Otherwise, the window slides and the read enters the window.
4 Experimental Results
4.1 Experimental Setup and Performance Metrics
The usefulness of the proposed striping scheme is verified by a series of trace-driven simulation. A simulator is built for performance verification. Figure 8(a) includes the default striping-system configuration. The timing characteristics of NAND flash are extracted from the data sheets of real NAND flash [3]. Figure 8(b) shows that the simulated striping system comprises an FTL implementation and the proposed striping scheme. The following assumptions are taken: A request is fulfilled as soon as the requested data can be reconstructed. The XOR operations are much faster than I/O operations, and the computational overheads are negligible. The throughput of one single NAND-flash bank does not saturate the data path to the host system. Overheads of the activities conducted in the off-line phase are not accounted.
The experimental workload is the disk I/O traces collected from the daily use of a real-life UMPC (Ultra-Mobile PC) ASUS R2H. The UMPC is equipped with a Celeron-M ULV processor, 768 MB of RAM, and a 18 GB disk. The operating system is Windows XP, and the file system is NTFS. Applications ran on the UMPC are ordinary to many people, such as web browsers, email
9
Figure 8: The experimental system configuration.
clients, movie players, FTP clients, office suites, and games. The time duration of trace collection is one month. The gathered traces are replayed on the striping system.
The striping system to which the proposed striping scheme is applied is referred to as a dynamic striping system. A dynamic striping system that never enters the off-line phase is referred to as a static striping system. In other words, the static striping system is not benefited by our striping scheme. Under the same system organization, the performance of a dynamic striping system and a static striping system are compared against each other. The default system configuration is shown in Figure 8(a). The default configuration is changed in terms of the number of code zones, the stripe-block size, and the zone size for evaluation. Because the proposed striping scheme mainly aims at improving response, two metrics are adopted. The read response ratio, RR ratio, is defined by:
RR ratio = Average read response time in the dynamic striping system Average read response time in the static striping system . The write response ratio, WR ratio, is defined for writes accordingly. For both the ratios, the smaller the better.
4.2 Numerical Results
4.2.1 Number of Code Zones
The first question on the proposed striping scheme would be how the proposed striping scheme improve response and how much redundancy is needed. It concerns because redundancy is a synonym of extra hardware costs. In the default system configuration, the total flash-memory capacity is 20 GB and the zone size is 256 MB. Therefore the system has 80 data zones. Different numbers of code zones are added to the default system configuration for evaluation. The results are shown in Figure 9.
It is shown that the read response is largely benefited by the adding of code zones, as expected. The improvement quickly saturates at around 8 code zones.
In this case, the extra space cost is 8/80=10%. The RR ratio is 69% in this
10
ϱϬй ϲϬй ϳϬй ϴϬй ϵϬй ϭϬϬй
Ϭ ϲ ϭϰ ϮϮ ϯϬ
EƵŵďĞƌŽĨĐŽĚĞnjŽŶĞƐ
ZĞƐƉŽŶƐĞƌĂƚŝŽƐ
ZZƌĂƚŝŽƐ tZƌĂƚŝŽƐ
Figure 9: Responsiveness ratios when different numbers of code zones. The smaller the ratios are, the better.
ϰϬй ϲϬй ϴϬй ϭϬϬй
ϭ Ϯ ϰ ϴ ϭϲ ϯϮ
^ƚƌŝƉĞͲďůŽĐŬƐŝnjĞƐ;ƉĂŐĞƐͿ
ZĞƐƉŽŶƐĞƌĂƚŝŽƐ
ZZƌĂƚŝŽƐ tZƌĂƚŝŽƐ
Figure 10: Response ratios with different stripe-block sizes.
case. In other words, in average reads are speeded up by about 31%. It is quite impressive. As to the WR ratios, the proposed striping scheme do not much affect them. Minor improvement of write response is gained because reads are not scheduled to the banks that already have been congested by writes. However, the interpretation of the WR ratios should be that the proposed striping scheme introduces no negative performance impact to the handling of writes.
4.2.2 Stripe-Block Sizes
The choose of the size of the stripe-block size is a tradeoff between performance and RAM-space requirements. The smaller the stripe-block size is, the larger RAM space is required for address translation in FTL. That is because a larger RAM-resident table is needed for fine-grained address translation. However, the benefit of using small stripe blocks is better parallelism. For example, a small write that is of 8 512-byte sectors can be parallelized over 4 banks if the stripe block is 1 page large (i.e., 2 KB). The same write involve only one bank if the stripe block size is 4 page large. Different stripe-block sizes are applied to the default system configuration for evaluation.
The results are shown in Figure 10. In general, the larger the strip-block size is, the larger the improvement of read response is. There are some reasons:
First, with large strip blocks, data must be accessed in terms of large stripe blocks even if only a small piece of data is requested. Reads and writes to small data are enlarged. Because small requests involve only a small number
First, with large strip blocks, data must be accessed in terms of large stripe blocks even if only a small piece of data is requested. Reads and writes to small data are enlarged. Because small requests involve only a small number