RELATED WORK - Tolerating Memory Latency Through Push Prefetching for Pointer-Intensive Applica

Early data prefetching research focused on array-based applications with regu-lar access patterns. Hardware prefetching detects array access strides from the address history at run time [Baer and Chen 1991; Fu and Patel 1992; Jouppi 1990]. Software prefetching [Callahan et al. 1991; Klaiber and Levy 1991;

Mowry et al. 1992; Porterfield 1989] exploits compile-time information to insert prefetch instructions in a program. Correlation-based prefetching [Alexander and Kedem 1996; Joseph and Grunwald 1997] also relies the address history to predict future references, but they can capture complex access patterns. The prediction accuracy relies on the size of the prediction table and stable access patterns.

The Spaid scheme proposed by Lipasti et al. [1995] is a compiler-based pointer prefetching mechanism. It inserts prefetch instructions for the objects pointed by pointer arguments at call sites. Luk and Mowry [1996] propose three compiler-based prefetching algorithms, greedy, history-pointer, and data-linearization prefetching. Chilimbi et al. [1999a, 1999b] reorganize data layouts to improve cache performance for irregular applications. Zhang and Torrellas [1995] use object information to guide prefetching for irregular applications in shared-memory multiprocessors. Mehrotra and Harrison [1996] extend stride detection schemes to capture both linear and recurrent access patterns. Roth et al. [1998] and Roth and Sohi [1999] propose a dynamic scheme to capture LDS traversal kernels and a jump-pointer prefetching framework to overcome the pointer-chasing problem. Ideally, the jump-pointer scheme can tolerate any amount of latency as long as the jump-pointer interval is set appropriately.

However, determining a suitable interval for each application is not a trivial task. Roth et al. do not provide a mechanism to adapt the jump-pointer inter-val on an application basis. Moreover, there are several limitations on jump-pointer prefetching. First, the jump-jump-pointer scheme incurs nontrivial run-time overhead, therefore, it could cause adverse effects on performance for applica-tions with traversal patterns that is not suitable for jump-pointer prefetching (e.g., dynamically changing data structures). Even though the push architec-ture does not work well for highly dynamic strucarchitec-tures either, we do not degrade performance. Second, the performance of jump-pointer is affected by the num-ber of traversals performed on data sets since it requires one pass to install jump pointers. For applications that traverse data sets only once, jump-pointers have to be installed during structure construction. If the construction order is

different from the traversal order, jump-point prefetching is not able to gener-ate correct future addresses. The push architecture does not have this limita-tion. Third, since jump-pointer prefetching relies on earlier traversals to install jump pointers, it does not work well if traversal orders change frequently (e.g., tree searching). In contrast, the push architecture executes traversal kernels to generate future addresses, therefore, it can still provide correct prefetches even traversal orders are not fixed. Karlsson et al. [1999] presents a prefetch ar-ray approach, which aggressively prefetches all possible nodes a few iterations ahead. The downside of this approach is that it could issue many unnecessary prefetches.

Several studies [Kang et al. 1999; Oskin et al. 1998; Patterson et al. 1997]

also combine processing power and memory in the same chip. The main function of DRAM processors is performing computation. Impulse [Carter et al. 1999]

provides configurable physical address remapping in the memory controller to improve bus and cache utilization. The memory controller is also capable of prefetching data. But they only prefetch next cache line, and data are not pushed up the memory hierarchy as proposed in this work. Concurrent with this study, Hughes [2002] evaluates memory-side prefetching in multiprocessor systems. His scheme does not provide solutions for two important design issues of the push model: the interaction protocol among prefetch engines at different memory modules and a mechanism to synchronize the CPU and PFE execution.

Solihin et al. [2002] also propose a memory-side prefetching scheme. They adopt the push model proposed in this work. They employ correlation prefetching and target at general irregular applications instead of linked data structures as this work. Even though they solve many problems in conventional correlation prefetching, their performance is still limited if repeated access patterns are absent. In contrast, our scheme does not rely on the past address history for future address prediction.

Some work also proposes using a separate processor for memory access.

Structured memory access [Pleszkun and Davidson 1983] and the decou-pled access execute [Smith 1982b] try to overlap demand memory requests with computation. VanderWiel and Lilja [1999] proposes a separate proces-sor for prefetching purposes for regular applications. Recently, several studies [Annavaram et al. 2001; Collins et al. 2001; Dundas and Mudge 1997; Luk 2001;

Roth and Sohi 2001; Sundaramoorthy et al. 2000; Zilles and Sohi 2001] suggest using pre-execution to improve cache performance for irregular applications.

The idea of pre-execution is to execute a sequence of instructions (speculative slice) early and speculatively to hide memory latency. They either employ a sep-arate processor at the L1 level to execute a speculative slice or simply invoke a helper thread if the CPU is a multithreading processor. This is essentially the pull model that we evaluate, and it can be limited by the conventional pull based data movement.

6. CONCLUSIONS

In this paper, we propose a cooperative hardware/software prefetching frame-work for linked data structures—the push architecture. The push architecture

scheme.

Our simulation results show that the push architecture is able to reduce 13% to 23% of the overall execution time for applications with very tight loops, which the traditional pull model is not able to run ahead of CPU to give sig-nificant performance improvement. Tree traversals benefit most from adding a small data cache in the L2/memory PFEs (e.g., treeadd). We have also shown that the proposed throttle mechanism successfully adjusts the prefetch distance to avoid early prefetches. For applications with enough computation between node accesses, the push architecture is able to achieve performance comparable to a perfect memory system. Simulations also show that the 2 PFE architec-ture, which only attaches the PFE to the L1 and main memory levels, performs comparably to 3 PFE, which attaches the PFE to each level of the memory hier-archy. We also compare the push architecture against jump-pointer prefetching, which is the state-of-art pulled-based prefetching mechanism, and find that the push architecture provides comparable average performance with jump-pointer prefetching.

REFERENCES

ALEXANDER, T.ANDKEDEM, G. 1996. Distributed predictive cache design for high performance memory system. In Proceedings of the 2nd International Symposium on High-Performance Com-puter Architecture. 254–263.

ANNAVARAM, M. M., PATEL, J. M.,ANDDAVIDSON, E. S. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Ar-chitecture. 52–61.

BAER, J.-L.ANDCHEN, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing. 176–186.

BURGER, D. C., AUSTIN, T. M.,ANDBENNETT, S. 1996. Evaluating Future Microprocessors the Sim-plescalar Tool Set. Tech. Rep. 1308, Computer Sciences Department, University of Wisconsin–

Madison. July.

CALLAHAN, D., KENNEDY, K.,ANDPORTERFIELD, A. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operat-ing Systems (ASPLOS IV). 40–52.

CARTER, J., HSIEH, W., STOLLER, L., SWANSON, M.,ANDZHANG, L. 1999. Impulse: Building a smarter memory controller. In Proceedings of 5th Symposium High-Performance Computer Architecture.

70–79.

CHILIMBI, T., LARUS, J., AND HILL, M. 1998. Improving Pointer-Based Codes through Cache-Concious Data Placement. Tech. Rep. CSL-TR-98-1365, University of Wisconsin, Madison. March.

CHILIMBI, T. M., DAVIDSON, B.,ANDLARUS, J. R. 1999a. Cache-conscious structure definition. In Pro-ceedings of the SIGPLAN ’99 Conference on Programming Language Design and Implementation.

13–24.

CHILIMBI, T. M., HILL, M. D.,ANDLARUS, J. R. 1999b. Cache-conscious structure layout. In Pro-ceedings of the SIGPLAN ’99 Conference on Programming Language Design and Implementation.

1–12.

COLLINS, J. D., WANG, H., TULLSEN, D. M., CHRISTOPHER, H. J., LEE, Y. F., LAVERY, D.,ANDSHEN, J. P.

2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 14–25.

DUNDAS, J.ANDMUDGE, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the ACM International Conference on Supercomputing.

176–186.

FU, J. W. C.ANDPATEL, J. H. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture. 102–110.

GWENNAP, L. 1998. Alpha 21364 to ease memory bottleneck. Microprocessor Report.

HUGHES, C. J. 2002. Prefetching linked data structures in systems with merged dram-logic. M.S.

Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign.

JOSEPH, D.ANDGRUNWALD, D. 1997. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 252–263.

JOUPPI, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Sympo-sium on Computer Architecture. 364–373.

KANG, Y., HUANG, W., YOO, S.-M., KEEN, D., GE, Z., LAM, V., PATTNAIK, P.,ANDTORRELLAS, J. 1999.

Flexram: Toward an advanced intelligent memory system. In Proceedings of the 1999 Interna-tional Conference on Computer Design. 192–201.

KARLSSON, M., DAHLGREN, F.,ANDSTENSTROM, P. 1999. A prefetching technique for irregular ac-cesses to linked data structures. In Proceedings of 6th Symposium High-Performance Computer Architecture. 206–217.

KESSLER, R. E. 1999. The alpha 21264 microprocessor. IEEE Micro. 34–36.

KLAIBER, A. C.ANDLEVY, H. M. 1991. An architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture. 43–53.

KOLB, C. The rayshade user’s guide. In http://graphics.stanford.edu/- cek/-rayshade.

KROFT, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture. 81–87.

LEBECK, A.ANDWOOD, D. 1994. Cache profiling and the spec benchmarks: A case study. IEEE Computer. 15–26.

LEIBSON, S. 2000. Xscale (strongarm-2) muscles. Microprocessor Report.

LIPASTI, M. H., SCHMIDT, W. J., KUNKEL, S. R.,ANDROEDIGER, R. R. 1995. Spaid: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual International Symposium on Microarchitecture. 252–263.

LUK, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simul-taneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 40–51.

LUK, C.-K.ANDMOWRY, T. C. 1996. Compiler based prefetching for recursive data structure. In Proceedings of the 7th International Conference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS VII). 222–233.

MEHROTRA, S.ANDHARRISON, L. 1996. Examination of a memory access classification scheme for pointer-intensive and numeric program. In Proceedings of the 10th International Conference on Supercomputing. 133–139.

MOWRY, T. C., LAM, M. S.,ANDGUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating System. 62–73.

OSKIN, M., CHONG, F. T.,ANDSHERWOOD, T. 1998. Active pages: a computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture.

192–203.

PATTERSON, D., ANDRESON, T., CARDWELL, N., FROMM, R., KEATON, K., KAZYRAKIS, C., THOMAS, R.,AND

YELLICK, K. 1997. A case for intelligent ram. IEEE Micro. 34–44.

PLESZKUN, A. R.ANDDAVIDSON, E. S. 1983. Structured memory access architecture. In Proceedings of International Conference on Parallel Processing. 461–471.

Proceedings of the International Society for Optical Engineers. 241–248.

SMITH, J. E. 1982b. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual International Symposium on Computer Architecture. 112–119.

SOHI, G. 1990. Instruction issue logic for high performance, interruptable, multiple functional unit, pipelined computers. IEEE Trans. Comput. 39, 3 (March), 349–359.

SOLIHIN, Y., LEE, J.,ANDTORRELLAS, J. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th Annual International Symposium on Computer Architec-ture. 171–182.

SUNDARAMOORTHY, K., PURSER, Z.,ANDROTENBERG, E. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architec-tural Support for Programming Languages and Operating Systems (ASPLOS IX). 257–268.

VANDERWIEL, S.ANDLILJA, D. J. 1999. A Compiler-Assisted Data Prefetch Controller. Tech. Rep.

ARCTiC-99-05, Department of Electrical and Computer Engineering, University of Minnesota.

May.

YANG, C.ANDLEBECK, A. R. 2000. Push vs. pull: Data movement for linked data structures. In Proceedings of the ACM International Conference on Supercomputing. 176–186.

ZHANG, Z.ANDTORRELLAS, J. 1995. Speeding up irregular applications in shared-memory multipro-cessors: Memory binding and group prefetching. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 188–200.

ZILLES, C. B.ANDSOHI, G. 2001. Execution-base prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 2–13.

Received July 2003; revised May 2004 and October 2004; accepted October 2004

在文檔中 Tolerating Memory Latency Through Push Prefetching for Pointer-Intensive Applications (頁 27-31)