4、 Algorithm Design and Implementation
4.1 The Architecture of an SSD controller
The FreeScale M68KIT912UF32 development kit [28, 26] is chosen as our target platform. It is shown in Figure 4. On the evaluation board there is an SoC MC9S12UF32. As shown in Figure 5, the SoC comprises a 16-bit MCU core M68HC12, 3.5 KB of RAM, 32 KB of NVRAM (i.e., NOR flash), a USB 2.0 interface, flash-memory host controllers, and an integrated queue controller (i..e, IQUEUE) with 1.5 KB QRAM buffer. The SoC aims at products including USB thumb drives, card readers, and solid-state disks.
The MC9S12UF32 (referred to as the SSD controller in the rest of this paper) accepts various types of flash-memory cards, including Security Digital (SD), Multi Media Card (MMC),
PC
NAND flash chips or cards interfaceUSB
DMA module Q RAM
(for DMA) NVRAM
M68HC12 MCU RAM
USB 2.0
SSD
NAND‐flash controller
Figure 5: The block diagram of MC9S12UF32.
Smart Media (SM), and Memory Stick (MS). It can also access ATA-based storage devices such as Compact Flash (CF) and ATA hard drives. We choose to use SM cards because, other than SM and MS cards, all the flash-memory cards hide the physical geometry of NAND flash from outside (i.e., another controller sits insides the card). An SM card is actually bare NAND flash and its blocks and pages are accessible to the SSD controller. The SSD controller interacts with the host via USB 2.0. It supports USB endpoints of control, interrupt, isochronous, and bulk. The SSD interacts with the host computer as if it were a hard drive.
The flash-memory controllers and other components are controlled by the MCU by means of their I/O registers. The registers are mapped into the MCU memory space. The memory mapping is shown in Figure 6. The firmware stored in the NVRAM is mapped to two 16 KB regions of the MCU memory space. Between addresses 1200h and 2000h, 1084 bytes of RAM are reserved as runtime memory, and two 1 KB regions are set aside for flash-memory management.
To achieve high throughput, all the data transfers between flash memory and the USB interface are carried out by a DMA module, e.g., the IQUEUE. The IQUEUE has eight DMA channels.
A 1.5 KB buffer between 2000h to 2600h is reserved for DMA operations, as Figure 6 shows.
The design is to exploit potential parallelism. For example, while data are read from NAND flash to QRAM, it is possible to concurrently transfer data to the USB interface. All the high-speed data transfers need no intervention from the MCU.
Registers
Figure 6: Memory map of MC9S12UF32.
Fread / fwrite
Figure 7: The handling of block-device commands to an SSD.
4.1.2 The Firmware
The firmware has two major tasks: block-device emulation and block-device command process-ing. Flash-memory Translation Layer (i.e., FTL) is implemented for block-device emulation.
It involves address translation, garbage collection, and wear leveling. The firmware accepts incoming block-device commands, translates disk geometry into flash-memory physical lo-cations, handles garbage collection and wear leveling, and responses to the host upon the completion of data transfer.
Because the SSD is a USB device, it is compliant to the USB mass-storage class specification.
As shown in Figure 7, to read or write a sector of the SSD, the host first composes a block-device command (e.g., SCSI read 28h or SCSI write 2Ah) as a SCSI command descriptor block (CDB). The CDB is then encapsulated by a USB command block wrapper (CBW).
The CBW has USB-specific information such as the data-transfer direction (IN or OUT).
The CBW is then encompassed by a USB Request Block (URB), which is sent to the USB
0
table (in RAM) Flash‐memory
blocks Translation
table (in RAM) Flash‐memory
blocks
Figure 8: The handling of a write to sector 35203.
driver of the host. The CBW is then received by the SSD via the USB PHY, and the CDB is extracted from the CBW. The block-device command CDB is then taken over by the FTL for flash-memory operations. Upon the completion, the firmware composes a command status wrapper (CSW) for response. The CSW is sent to the host via the USB PHY, and the host USB driver extracts necessary information from the CSW.
The FTL needs to handle address translation, garbage collection, and wear leveling. Suppose that 128 MB NAND flash is used. Let a 128 MB NAND flash be considered. Let the page size and the block size be 512 bytes and 16K bytes, respectively. The flash memory is logically divided into 8 segments, each of which has 1024 16-KB blocks. The SSD exposes itself as a 1000*16K*8 bytes block device. The firmware adopts a two-level mapping mechanism: At the first level, the eight 1000*16K bytes are statically mapped to the eight segments. At the second level, the 1000*16K bytes are mapped to 1024 blocks. Let the mapping unit size be 16K bytes, so in each segment 1000 logical units are mapped to 1024 physical units. There are 1024-1000=24 spare blocks for garbage collection and bad-block management. For the second-level mapping, each of the eight segment needs a translation table to translate a logical-unit address to the corresponding physical-unit address. A table is of 1000 entries and each entry refers to one of the 1024 physical units. To save RAM space, only two translation table are cached in RAM (address 0800h to 0fd0h and 1000h to 7d0h), as shown in Figure 6.
Because a mapping unit is 16 KB, which is no smaller than the block size, any write smaller than 16 KB rewrites an entire mapping unit because no partial update is allowed. Garbage-collection policy becomes trivial in this case. Let us consider an example shown in Figure
8. Suppose that a write to sector 35203 is received (the sector size is 512 bytes). First the corresponding segment is identified by
(35203 ∗ 512)/(1000 ∗ 16 ∗ 1024) = 1
. So the data falls in segment 1. Let T1[] be the translation table of segment 1. The data reside in block address
T1[((35203 ∗ 512)%(1000 ∗ 16 ∗ 1024))/(16 ∗ 1024)] = 300
of segment 1. Inside block 300, the data is in page
(((35203 ∗ 512)%(1000 ∗ 16 ∗ 1024))%(16 ∗ 1024))/512 = 3
. Let spare block 50 is used. Along with the new data, all the old data of block 100 except that in page 3 are copied to block 50, and then T1[300] is assigned to 50. Block 100 is erased and becomes a spare block.
To conduct wear leveling over blocks in a segment, on write a block is allocated from the 24 spare blocks in a FIFO fashion. There is no wear leveling over blocks of different segments.
In later sections we shall show that this approach is quite ineffective.