4.3 NVM Duet
4.5.2 Emerging NVM Technologies
In addition to PCM, NVM Duet is applicable to main memory made of other emerging NVM technologies, including memristor and STT-RAM [77, 81]. Certain modifications are required because of the different data storage mechanisms and data retention mecha-nisms among these NVM technologies as discussed as follows.
NVM Duet can be applicable to memristor-based main memory with moderate mod-ifications because memristors share many key characteristics with PCM. Memristors use metal oxides as their active material to store data. By electrically moving the boundary of doped and undoped regions of metal oxides, a memristor cell can exhibit different resis-tances to represent different data. Duet Scheduler can be directly applied to memristor-based main memory because Duet Scheduler involves little device-level properties. Dual Retention is also applicable because memristors also allow run-time tradeoffs between write speed and retention capability [55]. Smart Refresh is also applicable because sim-ilar to PCM, the resistance of a memristor cell gradually changes over time after the cell is written, and the remaining retention capability of a memristor cell can thus be inferred based on the resistance. One required modification is that the retention capability and re-freshing interval for working memory should be re-calculated (e.g., based on the analytic models presented in Section 4.4.2).
Compared with memristor-based main memory, a bit more modifications are required before STT-RAM-based main memory can adopt NVM Duet. STT-RAM uses magnetic tunnel junction (MTJ) as its active material. An MTJ element combines two ferromag-netic layers, a free layer and a fixed layer. The magnetization direction of the free layer can be electrically flipped relative to the fixed layer to represent one bit of information.
Duet Scheduler is directly applicable to STT-RAM-based main memory, while Dual Re-tention is not because STT-RAM requires MTJ sizing to exploit the tradeoffs between write speed and retention capability: a smaller MTJ volume exhibits shorter write latency but a higher probability of being randomly flipped (i.e., lower expected retention capa-bility). Without run-time retention adjustability, physically partitioning STT-RAM space into working memory and persistent store is required for Dual Retention. Interestingly, Smart Refresh is applicable to STT-RAM. The retention capability of an MTJ corresponds to a period from the time the MTJ is known correct (e.g., right when the MTJ is written) until the time the MTJ is considered unreliable, i.e., the risk that a random flip has oc-curred during the period is unacceptably high. Therefore, as long as an STT-RAM cell
is known correct during the read stage of a refresh operation, the cell can skip the write stage of the refresh operation while keeping reliable until the next refresh operation.
4.6 Related Work
Several previous studies have proposed using PCM as working memory [82, 131, 172].
Limited write endurance and relatively poor write performance are the two major con-cerns for PCM-based working memory. To address the write-endurance issue, differential write [82, 131, 172] eliminates unnecessary writes to PCM. Wear leveling [129, 130, 138, 172] spreads the writes across the entire PCM space to avoid concentrated memory wear.
Error correction schemes [16, 45, 63, 126, 137, 139, 165] keep the PCM functional even if certain cells wear out. To address PCM’s write-performance issue, write cancellation and write pausing [128] allow critical reads to abort or pause on-going writes. Write trunca-tion [70] exploits the error-correcting code’s ability to curtail a write operatrunca-tion before all of the bits are written correctly. In [127,167], PCM asymmetries are exploited to improve write performance. In [35, 57, 69], bit flipping and power budgeting are used to maximize PCM’s write parallelism, given a limited chip power supply.
PCM is also considered to be a promising solution for building high-performance persistent store. Because of the differences between PCM and traditional block-based storage in terms of access speed and access granularities, significant efforts in innovative programming models and system architecture design are ongoing [26, 28, 39, 40, 107, 108, 153–155, 157]. For example, BPFS [40] incorporates a new file system and hardware architecture. NV-heaps [39] and Mnemosyne [155] propose languages and libraries that enable in-memory persistence. In [107,108], memory allocation, memory protection, and cache architecture tailored to NVM-based persistent store are proposed.
Unification of working memory and persistent store in NVM was envisioned in [17, 29,132]. Based on a unified architecture, [116] integrates the space management of work-ing memory and persistent store. This approach maximizes the available workwork-ing mem-ory and reduces page swapping. Memorage [72] performs global wear leveling between working memory and persistent store to maximize the achievable lifetime.
Prior studies have presented Retention Relaxation for NVM to optimize the write performance or write endurance of NVM systems. In particular, these studies target STT-RAM caches [71, 114, 146, 148] and NAND flash SSDs [25, 96, 97, 105, 117], and refresh-ing is adopted. For PCM, selected patent applications [56,100,121] also present refreshrefresh-ing the PCM but do not focus on trading off retention capability and write performance in the manner emphasized by Retention Relaxation. Several papers [14, 15, 140, 164] also ana-lyze PCM’s retention capability and refreshing overhead, but they assume a pure working memory architecture instead of a unified structure, as in this work, and their analyses are based on PCM with a uniform target-band allocation. The work in [140] proposes lower-ing the PCM’s information density in return for enlarglower-ing the target bands and improvlower-ing the write speed. Shen’s master’s thesis [141] studies Retention Relaxation for improving PCM’s write speed, which is closely related to the Dual-Retention PCM portion presented in this chapter. Concurrent with this work, Approximate Storage [135] relaxes the error rate guarantee for data that are error-tolerable. Similar to the Dual-Retention PCM, Ap-proximate Storage manipulates PCM’s target bands for improved write speed.
4.7 Summary
In this chapter we present a novel unified working memory and persistent store architec-ture, NVM Duet. We observe that naively making NVM serve as both working memory and persistent store is a suboptimal approach because of the intrinsic differences between the two use cases. This problem occurs because without knowledge of the use case of an individual request, a unified memory architecture must serve all requests under the highest design constraints in terms of consistency and durability.
NVM Duet adopts a cross-layer approach to provide the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory, thereby improving system performance. Specifically, a new HW/SW interface allows the OS to convey the use case of data declared by the programmer (working memory or persistent store) to the memory controller. The mem-ory controller is then enhanced with a new scheduler, Duet Scheduler, which seeks to
fully exploit the parallelism present in the address stream while respecting the write order to guarantee the consistency of persistent store. Furthermore, a new PCM architecture, Dual-Retention PCM, is proposed to provide two retention levels and relax the durability constraint for working memory. To make Dual-Retention PCM appealing for practical adoption, NVM Duet employs Smart Refresh that exploits PCM devices’ property to eliminate unnecessary refresh overhead. Experiments based on full-system simulations and workload mixes of persistent and conventional workloads demonstrate that our NVM Duet design achieves up to a 1.68× (1.32× on average) speedup over the baseline.
Chapter 5
Conclusions and Future Directions
With the ever-increasing demands to support emerging applications such as cloud com-puting, social networking, and big data analytics, it has never been more important to design and build low-power, high-performance servers. To achieve this goal, flash-based storage, flash-based storage caches, and PCM-based main memory are key enabling tech-niques. Both industry and academia dedicate increasing efforts to investigate design and optimization for the adoption of these techniques in servers.
Over the course of this dissertation, I identified new insights and opened up new design and optimization space for NVM in servers. Specifically, different from many prior flash-related studies that have focused on file systems, I/O schedulers, and FTL, I especially considered the increasingly significant device-level characteristics of flash and propose co-designing hardware and software layers collaboratively. Moreover, most prior PCM-related studies have focused on architecting PCM as either pure working memory or pure persistent store, whereas I focused on revisiting and redesigning hardware and software layers so that PCM can efficiently serve as both roles.
In Chapter 2, I presented Retention Relaxation, the first work on optimizing flash storage by relaxing flash’s data retention capability. I developed a flash model to evaluate the benefits if flash’s multi-year retention capability can be reduced. I also demonstrated that in real systems, write requests usually require days or even shorter retention. I de-signed flash storage which handles host writes with shortened retention while handling background writes as usual to optimize the write speed and ECC cost-performance. I also
presented corresponding retention tracking schemes to guarantee that no data loss hap-pens due to a shortage of retention capability. Simulation results demonstrated that the proposed design achieves 1.8–5.7× write response time speedup. I also showed that for future flash storage, Retention Relaxation can achieve a superior cost-performance point for the ECC architectures.
In Chapter 3, I presented DuraCache to tackle the lifetime issue of flash caches in dat-acenters. DuraCache exploits the fact that flash caches are write-through caches. There-fore, uncorrectable errors in flash caches can be handled like cache misses which bring in correct data from HDD arrays. In addition, DuraCache gradually allocates more ECC parities associated with data when flash reaches wearout thresholds. These strategies al-low flash to continue operating with slightly increased miss rates and gradually sacrificed available capacity. I conducted empirical experiments characterizing the correlation be-tween flash’s BER and write cycles and demonstrated that DuraCache enables flash caches made of MLC flash to achieve a 4.1 years of service life assuming a TPC-C workload.
In Chapter 4, I presented the insight that naively making PCM main memory simul-taneously serve as working memory and persistent store is suboptimal. Persistent store demands consistency and durability guarantees, thereby imposing additional design con-straints on the memory system. Consistency is achieved at the expense of serializing multiple write requests. Durability requires memory cells to guarantee non-volatility and thus reduces the write speed. I presented NVM Duet, a novel PCM main memory archi-tecture, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if access to NVM is for working memory. Sim-ulation results demonstrated that NVM Duet achieves up to 1.68× (1.32× on average) speedup compared with the baseline design.
There are several extensions that I would like to pursue in the future:
1. One future research direction is to analyze the proposed architectures using real systems and real applications. I plan to utilize DRAM-based FPGA platforms to emulate servers adopting PCM main memory with NVM Duet. I also plan to pro-totype flash-based storage that adopts DuraCache and Retention Relaxation. Based
on real systems, I can investigate the design space and optimization opportunities of executing emerging applications on NVM-enabled servers. More architectural insights into adopting NVM in servers are also expected to be made.
2. The proposed architectures can be extended considering different NVM technolo-gies. NVM Duet is designed and evaluated using PCM as the target memory tech-nology though, the concept is applicable to other NVM technologies such as mem-ristors and STT-RAM [77, 81]. Similarly, the ideas of DuraCache and Retention Relaxation are also potentially applicable to improving the lifetime and perfor-mance of PCM-based storage [26]. Additional modeling techniques and optimiza-tion methodologies are anticipated to be developed for investigating these extended directions.
3. The proposed architectures can be collaboratively combined with one another. For example, in the future a server can equip both PCM main memory and flash caches, and it will be natural and reasonable to simultaneously adopt both NVM Duet and DuraCache architectures in one system. Therefore, evaluating and optimizing the interplay among NVM Duet and DuraCache for this type of combination are inter-esting future research topics.
Bibliography
[1] Association for computing machinery (ACM) digital library. http://dl.acm.org/.
[2] Hadoop. Avaliable at http://hadoop.apache.org/.
[3] Hammerora. Available at http://hammerora.sourceforge.net/.
[4] SNIA non volatile memory (NVM) programming technical work group (TWG).
Available at http://snia.org/forums/sssi/nvmp/.
[5] Western Digital settles hard-drive capacity lawsuit. FOX News, June 2006.
[6] JESD218A: Solid-state drive (SSD) requirements and endurance test method, Feb.
2011.
[7] NAND Flash support table, July 2011. available at http://www.linux-mtd.infradead.org/nand-data/nanddata.html.
[8] Considerations for choosing SLC versus MLC flash, REV A01. Technical Report 300-013-740, EMC Corp., 2012.
[9] Process integration, devices, and structures (PIDS). Technical report, ITRS, 2012.
[10] A. R. Abdurrab, T. Xie, and W. Wang. DLOOP: A flash translation layer exploiting plane-level parallelism. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2013.
[11] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Pan-igrahy. Design tradeoffs for SSD performance. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC), 2008.
[12] D. G. Andersen and S. Swanson. Rethinking flash in the data center. IEEE Micro, 30:52–54, July 2010.
[13] F. Arai, T. Maruyama, and R. Shirota. Extended data retention process technology for highly reliable Flash EEPROMs of 106 to 107 W/E cycles. In Proceedings of the IEEE International Reliability Physics Symposium (IRPS), 1998.
[14] M. Awasthi, M. Shevgoor, K. Sudan, R. Balasubramonian, B. Rajendran, and V. Srinivasan. Handling PCM resistance drift with device, circuit, architecture, and system solutions. In Proceedings of the Non-Volatile Memories Workshop (NVMW), 2011.
[15] M. Awasthi, M. Shevgoor, K. Sudan, B. Rajendran, R. Balasubramonian, and V. Srinivasan. Efficient scrub mechanisms for error-prone emerging memories.
In Proceedings of the IEEE International Symposium on High-Performance Com-puter Architecture (HPCA), 2012.
[16] R. Azevedo, J. D. Davis, K. Strauss, P. Gopalan, M. Manasse, and S. Yekhanin.
Zombie memory: Extending memory lifetime by reviving dead blocks. In Pro-ceedings of the ACM International Symposium on Computer Architecture (ISCA), 2013.
[17] K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. Operating system implications of fast, cheap, non-volatile memory. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS), 2011.
[18] F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze, M. Jagasivamani, E. C. Buda, F. Pellizzer, D. W. Chow, A. Cabrini, G. M. A. Calvi, R. Faravelli, A. Fantini, G. Torelli, D. Mills, R. Gastaldi, and G. Casagrande. A multi-level-cell bipolar-selected phase-change memory. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), 2008.
[19] F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze, M. Jagasivamani, E. C. Buda, F. Pellizzer, D. W. Chow, A. Cabrini, G. M. A. Calvi, R. Faravelli, A. Fantini, G. Torelli, D. Mills, R. Gastaldi, and G. Casagrande. A bipolar-selected phase change memory featuring multi-level cell storage. IEEE J. Solid-St. Circ. (JSSC), 44(1):217–227, January 2009.
[20] F. Bellard. QEMU, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC), 2005.
[21] A. Birrell, M. Isard, C. Thacker, and T. Wobber. A design for high-performance flash disks. SIGOPS Oper. Syst. Rev., 41(2):88–93, April 2007.
[22] R. C. Bose and D. K. Ray-Chaudhuri. On a class of error correcting binary group codes. Information and Control, 3(1):68–79, March 1960.
[23] J. Brewer and M. Gill. Nonvolatile Memory Technologies with Emphasis on Flash:
A Comprehensive Guide to Understanding and Using Flash Memory Devices.
Wiley-IEEE Press, 2008.
[24] J. S. Bucy, J. Schindler, S. W. Schlosser, G. R. Ganger, and Contributors. The DiskSim simulation environment version 4.0 reference manual (CMU-PDL-08-101). Technical report, Parallel Data Laboratory, 2008.
[25] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Unsal, and K. Mai.
Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime. In Proceedings of the IEEE International Conference on Com-puter Design (ICCD), 2012.
[26] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swanson.
Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2010.
[27] A. M. Caulfield, L. M. Grupp, and S. Swanson. Gordon: Using flash memory to build fast, power-efficient clusters for data-intensive applications. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2009.
[28] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, and S. Swanson. Pro-viding safe, user space access to fast, solid state disks. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.
[29] J. Chang, P. Ranganathan, T. Mudge, D. Roberts, M. A. Shah, and K. T. Lim. A limits study of benefits from nanostore-based future data-centric system architec-tures. In Proceedings of the ACM Computing Frontiers Conference (CF), 2012.
[30] Y.-H. Chang, J.-W. Hsieh, and T.-W. Kuo. Improving flash wear-leveling by proac-tively moving static data. IEEE Trans. Comput. (TC), 59(1):53–65, January 2010.
[31] Y.-H. Chang, P.-L. Wu, T.-W. Kuo, and S.-H. Hung. An adaptive file-system-oriented FTL mechanism for flash-memory storage systems. ACM Trans. Embed.
Comput. Syst. (TECS), 11(1):9:1–9:19, April 2012.
[32] F. Chen, T. Luo, and X. Zhang. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), 2011.
[33] T.-H. Chen, Y.-Y. Hsiao, Y.-T. Hsing, and C.-W. Wu. An adaptive-rate error cor-rection scheme for NAND flash memory. In Proceedings of the IEEE VLSI Test Symposium (VTS), 2009.
[34] M.-L. Chiang, P. C. H. Lee, and R.-C. Chang. Using data clustering to improve cleaning performance for flash memory. Softw. Pract. Exper., 29:267–290, March 1999.
[35] S. Cho and H. Lee. Flip-N-Write: A simple deterministic technique to im-prove PRAM write performance, energy and endurance. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009.
[36] S.-W. Choi, G.-P. Kim, and J.-K. Kim. An LDPC decoder architecture for multi-rate QC-LDPC codes. In Proceedings of the IEEE International Midwest Sympo-sium on Circuits and Systems (MWSCAS), 2011.
[37] Y. Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M. G. Kang, J. Lee, Y. Kwon, S. Kim, J. Kim, Y.-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y.-T.
Lee, J. Yoo, and G. Jeong. A 20nm 1.8V 8Gb PRAM with 40MB/s program band-width. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), 2012.
[38] G. F. Close, U. Frey, J. Morrish, R. Jordan, S. Lewis, T. Maffitt, M. Breitwisch, C. Hagleitner, C. Lam, and E. Eleftheriou. A 512Mb phase-change memory (PCM) in 90nm CMOS achieving 2b/cell. In Proceedings of the IEEE International Sym-posium on VLSI Circuits (VLSIC), 2011.
[39] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, and S. Swanson. NV-heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the ACM International Con-ference on Architectural Support for Programming Languages and Operating Sys-tems (ASPLOS), 2011.
[40] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee.
Better I/O through byte-addressable, persistent memory. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2009.
[41] G. Dong, N. Xie, and T. Zhang. On the use of soft-decision error-correction codes in NAND flash memory. IEEE Trans. Circuits Syst. Regul. Pap., 58(2):429 –439, February 2011.
[42] EMC Corp. EMC VFCache server flash cache for superior performance, intelli-gence, and protection of mission-critical data. EMC VFCache datasheet H9581.1, 2012.
[43] EMC Corp. White paper: Introduction to EMC VFCache, 2012. H10502.2.
[44] EMC Corp. White paper: Introduction to EMC XtremSW cache, 2013. H11946.1.
[45] J. Fan, S. Jiang, J. Shu, Y. Zhang, and W. Zhen. Aegis: Partitioning data block for efficient recovery of stuck-at-faults in phase change memory. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.
[46] D. Floyer. Flash pricing trends disrupt storage. Technical report, May 2010.
[47] Fusion-io Inc. Fusion-io ioCache, 2014. Available at http://www.fusionio.com/products/iocache/.
[48] Fusion-io Inc. Fusion-io ioDrive Duo, 2014. Available at http://www.fusionio.com/products/iodrive-duo/.
[49] R. G. Gallager. Low-density parity-check codes. IRE Trans. Inf. Theory, 8(1):21 –28, January 1962.
[50] M. Ghosh and H.-H. S. Lee. Smart refresh: An enhanced memory controller design for reducing energy in conventional and 3D die-stacked DRAMs. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2007.
[51] J. Gray. Tape is dead. Disk is tape. Flash is disk. RAM locality is king. Presented at the CIDR Gong Show, January 2007.
[52] L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. H. Siegel, and J. K. Wolf. Characterizing flash memory: Anomalies, observations, and applica-tions. In Proceedings of the IEEE/ACM International Symposium on Microarchi-tecture (MICRO), 2009.
[53] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: A flash translation layer employing
[53] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: A flash translation layer employing