Multithreading and Simultaneous Multithreading

The concept of multithreading dates back to one of the earliest transistorized computers, the TX-2. TX-2 is also famous for being the computer on which Ivan Sutherland created Sketchpad, the first computer graphics system. TX-2 was built at MIT’s Lincoln Laboratory and became operational in 1959. It used multiple threads to support fast context switching to handle I/O functions. Clark [1957]

described the basic architecture, and Forgie [1957] described the I/O architecture.

Multithreading was also used in the CDC 6600, where a fine-grained multithread-ing scheme with interleaved schedulmultithread-ing among threads was used as the architec-ture of the I/O processors. The HEP processor, a pipelined multiprocessor designed by Denelcor and shipped in 1982, used fine-grained multithreading to hide the pipeline latency as well as to hide the latency to a large memory shared among all the processors. Because the HEP had no cache, this hiding of memory latency was critical. Burton Smith, one of the primary architects, described the HEP architecture in a 1978 paper, and Jordan [1983] published a performance evaluation. The TERA processor extends the multithreading ideas and is described by Alverson et al. in a 1992 paper. The Niagara multithreading approach is similar to those of the HEP and TERA systems, although Niagara employs caches reducing the need for thread-based latency hiding.

In the late 1980s and early 1990s, researchers explored the concept of coarse-grained multithreading (also called block multithreading) as a way to tolerate latency, especially in multiprocessor environments. The SPARCLE processor in the Alewife system used such a scheme, switching threads whenever a high-latency exceptional event, such as a long cache miss, occurred. Agarwal et al.

described SPARCLE in a 1993 paper. The IBM Pulsar processor uses similar ideas.

By the early 1990s, several research groups had arrived at two key insights.

First, they realized that fine-grained multithreading was needed to get the maxi-mum performance benefit, since in a coarse-grained approach, the overhead of thread switching and thread start-up (e.g., filling the pipeline from the new thread) negated much of the performance advantage (see Laudon, Gupta, and Horowitz [1994]). Second, several groups realized that to effectively use large numbers of functional units would require both ILP and thread-level parallelism (TLP). These insights led to several architectures that used combinations of mul-tithreading and multiple issue. Wolfe and Shen [1991] described an architecture called XIMD that statically interleaves threads scheduled for a VLIW processor.

Hirata et al. [1992] described a proposed processor for media use that combines a static superscalar pipeline with support for multithreading; they reported speed-ups from combining both forms of parallelism. Keckler and Dally [1992] com-bined static scheduling of ILP and dynamic scheduling of threads for a processor with multiple functional units. The question of how to balance the allocation of functional units between ILP and TLP and how to schedule the two forms of par-allelism remained open.

When it became clear in the mid-1990s that dynamically scheduled supersca-lars would be delivered shortly, several research groups proposed using the dynamic scheduling capability to mix instructions from several threads on the fly.

Yamamoto et al. [1994] appear to have published the first such proposal, though the simulation results for their multithreaded superscalar architecture use simplis-tic assumptions. This work was quickly followed by Tullsen, Eggers, and Levy [1995], who provided the first realistic simulation assessment and coined the term simultaneous multithreading. Subsequent work by the same group together with industrial coauthors addressed many of the open questions about SMT. For example, Tullsen et al. [1996] addressed questions about the challenges of sched-uling ILP versus TLP. Lo et al. [1997] provided an extensive discussion of the SMT concept and an evaluation of its performance potential, and Lo et. al. [1998]

evaluated database performance on an SMT processor. Tuck and Tullsen [2003]

reviewed the performance of SMT on the Pentium 4.

The IBM Power4 introduced multithreading (see Tendler et al. [2002]), while the Power5 used simultaneous multithreading. Mathis et al. [2005] explored the performance of SMT in the Power5, while Sinharoy et al. [2005] described the system architecture.

References

Agarwal, A., J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G. D’Souza, and M. Parkin [1993]. “Sparcle: An evolutionary processor design for large-scale multiprocessors,”

IEEE Micro 13 (June), 48–61.

Agerwala, T., and J. Cocke [1987]. High Performance Reduced Instruction Set Proces-sors, Tech. Rep. RC12434, IBM Thomas Watson Research Center, Yorktown Heights, N.Y.

Alverson, G., R. Alverson, D. Callahan, B. Koblenz, A. Porterfield, and B. Smith [1992].

“Exploiting heterogeneous parallelism on a multithreaded multiprocessor,” Proc.

ACM/IEEE Conf. on Supercomputing, November 16–20, 1992, Minneapolis, Minn., 188–197.

Anderson, D. W., F. J. Sparacio, and R. M. Tomasulo [1967]. “The IBM 360 Model 91:

Processor philosophy and instruction handling,” IBM J. Research and Development 11:1 (January), 8–24.

Austin, T. M., and G. Sohi [1992]. “Dynamic dependency analysis of ordinary programs,”

Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 342–351.

Babbay, F., and A. Mendelson [1998]. “Using value prediction to increase the power of speculative execution hardware,” ACM Trans. on Computer Systems 16:3 (August), 234–270.

Bakoglu, H. B., G. F. Grohoski, L. E. Thatcher, J. A. Kaeli, C. R. Moore, D. P. Tattle, W. E. Male, W. R. Hardell, D. A. Hicks, M. Nguyen Phu, R. K. Montoye, W. T.

Glover, and S. Dhawan [1989]. “IBM second-generation RISC processor organization,” Proc. IEEE Int’l. Conf. on Computer Design, October, Rye Brook, N.Y., 138–142.

Ball, T., and J. Larus [1993]. “Branch prediction for free,” Proc. ACM SIGPLAN’93 Con-ference on Programming Language Design and Implementation (PLDI), June 23–25, 1993, Albuquerque, N.M., 300–313.

Bhandarkar, D., and D. W. Clark [1991]. “Performance from architecture: Comparing a RISC and a CISC with similar hardware organizations,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif., 310–319.

Bhandarkar, D., and J. Ding [1997]. “Performance characterization of the Pentium Pro processor,” Proc. Third Int’l. Symposium on High Performance Computer Architec-ture, February 1–5, 1997, San Antonio, Tex., 288–297.

Bloch, E. [1959]. “The engineering design of the Stretch computer,” Proc. Eastern Joint Computer Conf., December 1–3, 1959, Boston, Mass., 48–59.

Bucholtz, W. [1962]. Planning a Computer System: Project Stretch, McGraw-Hill, New York.

Calder, B., D. Grunwald, M. Jones, D. Lindsay, J. Martin, M. Mozer, and B. Zorn [1997].

“Evidence-based static branch prediction using machine learning,” ACM Trans. Pro-gram. Lang. Syst. 19:1, 188–222.

Calder, B., G. Reinman, and D. M. Tullsen [1999]. “Selective value prediction,” Proc.

26th Annual Int’l. Symposium on Computer Architecture (ISCA), May 2–4, 1999, Atlanta, Ga.

Chang, P. P., S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu [1991]. “IMPACT:

An architectural framework for multiple-instruction-issue processors,” Proc. 18th An-nual Int’l. Symposium on Computer Architecture (ISCA), May 27–30, 1991, Toronto, Canada, 266–275.

Charlesworth, A. E. [1981]. “An approach to scientific array processing: The architecture design of the AP-120B/FPS-164 family,” Computer 14:9 (September), 18–27.

Chen, T. C. [1980]. “Overlap and parallel processing,” in Introduction to Computer Archi-tecture, H. Stone, ed., Science Research Associates, Chicago, 427–486.

Chrysos, G. Z., and J. S. Emer [1998]. “Memory dependence prediction using store sets,”

Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 142–153.

Clark, D. W. [1987]. “Pipelining and performance in the VAX 8800 processor,” Proc.

Second Int’l. Conf. on Architectural Support for Programming Languages and Oper-ating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 173–177.

Clark, W. A. [1957]. “The Lincoln TX-2 computer development,” Proc. Western Joint Computer Conference, February 26–28, 1957, Los Angeles, 143–145.

Colwell, R. P., and R. Steck [1995]. “A 0.6 µm BiCMOS processor with dynamic execu-tion.” Proc. of IEEE Int’l. Symposium on Solid State Circuits (ISSCC), February 15–

17, 1995, San Francisco, 176–177.

Colwell, R. P., R. P. Nix, J. J. O’Donnell, D. B. Papworth, and P. K. Rodman [1987].

“A VLIW architecture for a trace scheduling compiler,” Proc. Second Int’l. Conf.

on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 180–192.

Cvetanovic, Z., and R. E. Kessler [2000]. “Performance analysis of the Alpha 21264-based Compaq ES40 system,” 27th Annual Int’l. Symposium on Computer Architec-ture (ISCA), June 10–14, 2000, Vancouver, Canada, 192–202.

Davidson, E. S. [1971]. “The design and control of pipelined function generators,” Proc.

IEEE Conf. on Systems, Networks, and Computers, January 19–21, 1971, Oaxtepec, Mexico, 19–21.

Davidson, E. S., A. T. Thomas, L. E. Shar, and J. H. Patel [1975]. “Effective control for pipelined processors,” Proc. IEEE COMPCON, February 25–27, 1975, San Francisco, 181–184.

Dehnert, J. C., P. Y.-T. Hsu, and J. P. Bratt [1989]. “Overlapped loop support on the Cy-dra 5,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, Mass., 26–39.

Diep, T. A., C. Nelson, and J. P. Shen [1995]. “Performance evaluation of the PowerPC 620 microarchitecture,” Proc. 22nd Annual Int’l. Symposium on Computer Architec-ture (ISCA), June 22–24, 1995, Santa Margherita, Italy.

Ditzel, D. R., and H. R. McLellan [1987]. “Branch folding in the CRISP microprocessor:

Reducing the branch delay to zero,” Proc. 14th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1987, Pittsburgh, Penn., 2–7.

Douglas, J. [2005]. “Intel 8xx series and Paxville Xeon-MP Microprocessors,” paper pre-sented at Hot Chips 17, August 14–16, 2005, Stanford University, Palo Alto, Calif.

Eden, A., and T. Mudge [1998]. “The YAGS branch prediction scheme,” Proc. of the 31st Annual ACM/IEEE Int’l. Symposium on Microarchitecture, November 30–December 2, 1998, Dallas, Tex., 69–80.

Edmondson, J. H., P. I. Rubinfield, R. Preston, and V. Rajagopalan [1995]. “Superscalar instruction execution in the 21164 Alpha microprocessor,” IEEE Micro 15:2, 33–43.

Ellis, J. R. [1986]. Bulldog: A Compiler for VLIW Architectures, MIT Press, Cambridge, Mass.

Emer, J. S., and D. W. Clark [1984]. “A characterization of processor performance in the VAX-11/780,” Proc. 11th Annual Int’l. Symposium on Computer Architecture (IS-CA), June 5–7, 1984, Ann Arbor, Mich., 301–310.

Evers, M., S. J. Patel, R. S. Chappell, and Y. N. Patt [1998]. “An analysis of correlation and predictability: What makes two-level branch predictors work,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 52–61.

Fisher, J. A. [1981]. “Trace scheduling: A technique for global microcode compaction,”

IEEE Trans. on Computers 30:7 (July), 478–490.

Fisher, J. A. [1983]. “Very long instruction word architectures and ELI-512,” 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Swe-den, 140–150.

Fisher, J. A., and S. M. Freudenberger [1992]. “Predicting conditional branches from pre-vious runs of a program,” Proc. Fifth Int’l. Conf. on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 85–95.

Fisher, J. A., and B. R. Rau [1993]. Journal of Supercomputing, January (special issue).

Fisher, J. A., J. R. Ellis, J. C. Ruttenberg, and A. Nicolau [1984]. “Parallel processing:

A smart compiler and a dumb processor,” Proc. SIGPLAN Conf. on Compiler Con-struction, June 17–22, 1984, Montreal, Canada, 11–16.

Forgie, J. W. [1957]. “The Lincoln TX-2 input-output system,” Proc. Western Joint Com-puter Conference, February 26–28, 1957, Los Angeles, 156–160.

Foster, C. C., and E. M. Riseman [1972]. “Percolation of code to enhance parallel dis-patching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–

1415.

Gallagher, D. M., W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and W.W. Hwu [1994].

“Dynamic memory disambiguation using the memory conflict buffer,” Proc. Sixth

Int’l. Conf. on Architectural Support for Programming Languages and Operating Sys-tems (ASPLOS), October 4–7, Santa Jose, Calif., 183–193.

González, J., and A. González [1998]. “Limits of instruction level parallelism with data speculation,” Proc. Vector and Parallel Processing (VECPAR) Conf., June 21–23, 1998, Porto, Portugal, 585–598.

Heinrich, J. [1993]. MIPS R4000 User’s Manual, Prentice Hall, Englewood Cliffs, N.J.

Hinton, G., D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel [2001].

“The microarchitecture of the Pentium 4 processor,” Intel Technology Journal, February.

Hirata, H., K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa [1992]. “An elementary processor architecture with simultaneous instruction issuing from multiple threads,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 136–145.

Hopkins, M. [2000]. “A critical look at IA-64: Massive resources, massive ILP, but can it deliver?” Microprocessor Report, February.

Hsu, P. [1994]. “Designing the TFP microprocessor,” IEEE Micro 18:2 (April), 2333.

Huck, J. et al. [2000]. “Introducing the IA-64 Architecture” IEEE Micro, 20:5 (September–

October), 12–23.

Hwu, W.-M., and Y. Patt [1986]. “HPSm, a high performance restricted data flow archi-tecture having minimum functionality,” 13th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1986, Tokyo, 297–307.

Hwu, W. W., S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lav-ery [1993]. “The superblock: An effective technique for VLIW and superscalar compilation,” J. Supercomputing 7:1, 2 (March), 229–248.

IBM. [1990]. “The IBM RISC System/6000 processor” (collection of papers), IBM J.

Research and Development 34:1 (January).

Jimenez, D. A., and C. Lin [2002]. “Neural methods for dynamic branch prediction,”

ACM Trans. Computer Sys 20:4 (November), 369–397.

Johnson, M. [1990]. Superscalar Microprocessor Design, Prentice Hall, Englewood Cliffs, N.J.

Jordan, H. F. [1983]. “Performance measurements on HEP—a pipelined MIMD computer,” Proc. 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Sweden, 207–212.

Jouppi, N. P., and D. W. Wall [1989]. “Available instruction-level parallelism for superscalar and superpipelined processors,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, 272–282.

Kaeli, D. R., and P. G. Emma [1991]. “Branch history table prediction of moving target branches due to subroutine returns,” Proc. 18th Annual Int’l. Symposium on Computer Architecture (ISCA), May 27–30, 1991, Toronto, Canada, 34–42.

Keckler, S. W., and W. J. Dally [1992]. “Processor coupling: Integrating compile time and runtime scheduling for parallelism,” Proc. 19th Annual Int’l. Symposium on Comput-er Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 202–213.

Keller, R. M. [1975]. “Look-ahead processors,” ACM Computing Surveys 7:4 (Decem-ber), 177–195.

Keltcher, C. N., K. J. McGrath, A. Ahmed, and P. Conway [2003]. “The AMD Opteron processor for multiprocessor servers,” IEEE Micro 23:2 (March–April), 66–76.

Kessler, R. [1999]. “The Alpha 21264 microprocessor,” IEEE Micro 19:2 (March/April) 24–36.

Killian, E. [1991]. “MIPS R4000 technical overview–64 bits/100 MHz or bust,” Hot Chips III Symposium Record, August 26–27, 1991, Stanford University, Palo Alto, Calif., 1.6–1.19.

Kogge, P. M. [1981]. The Architecture of Pipelined Computers, McGraw-Hill, New York.

Kumar, A. [1997]. “The HP PA-8000 RISC CPU,” IEEE Micro 17:2 (March/April).

Kunkel, S. R., and J. E. Smith [1986]. “Optimal pipelining in supercomputers,” Proc. 13th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1986, Tokyo, 404–414.

Lam, M. [1988]. “Software pipelining: An effective scheduling technique for VLIW pro-cessors,” SIGPLAN Conf. on Programming Language Design and Implementation, June 22–24, 1988, Atlanta, Ga., 318–328.

Lam, M. S., and R. P. Wilson [1992]. “Limits of control flow on parallelism,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 46–57.

Laudon, J., A. Gupta, and M. Horowitz [1994]. “Interleaving: A multithreading technique targeting multiprocessors and workstations,” Proc. Sixth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 4–7, San Jose, Calif., 308–318.

Lauterbach, G., and T. Horel [1999]. “UltraSPARC-III: Designing third generation 64-bit performance,” IEEE Micro 19:3 (May/June).

Lipasti, M. H., and J. P. Shen [1996]. “Exceeding the dataflow limit via value prediction,”

Proc. 29th Int’l. Symposium on Microarchitecture, December 2–4, 1996, Paris, France.

Lipasti, M. H., C. B. Wilkerson, and J. P. Shen [1996]. “Value locality and load value pre-diction,” Proc. Seventh Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1–5, 1996, Cambridge, Mass., 138–147.

Lo, J., L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh [1998]. “An analy-sis of database workload performance on simultaneous multithreaded processors,”

Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 39–50.

Lo, J., S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen [1997]. “Converting thread-level parallelism into instruction-thread-level parallelism via simultaneous multithreading,”

ACM Trans. on Computer Systems 15:2 (August), 322–354.

Mahlke, S. A., W. Y. Chen, W.-M. Hwu, B. R. Rau, and M. S. Schlansker [1992].

“Sentinel scheduling for VLIW and superscalar processors,” Proc. Fifth Int’l.

Conf. on Architectural Support for Programming Languages and Operating Sys-tems (ASPLOS), October 12–15, 1992, Boston, 238–247.

Mahlke, S. A., R. E. Hank, J. E. McCormick, D. I. August, and W. W. Hwu [1995].

“A comparison of full and partial predicated execution support for ILP proces-sors,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy, 138–149.

Mathis, H. M., A. E. Mercias, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel [2005].

“Characterization of the multithreading (SMT) efficiency in Power5,” IBM J. of Research and Development, 49:4/5 (July/September), 555–564.

McCormick, J., and A. Knies [2002]. “A brief analysis of the SPEC CPU2000 bench-marks on the Intel Itanium 2 processor,” paper presented at Hot Chips 14, August 18–

20, 2002, Stanford University, Palo Alto, Calif.

McFarling, S. [1993]. Combining Branch Predictors, WRL Technical Note TN-36, Digi-tal Western Research Laboratory, Palo Alto, Calif.

McFarling, S., and J. Hennessy [1986]. “Reducing the cost of branches,” Proc. 13th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1986, Tokyo, 396–403.

McNairy, C., and D. Soltis [2003]. “Itanium 2 processor microarchitecture,” IEEE Micro 23:2 (March–April), 44–55.

Moshovos, A., and G. S. Sohi [1997]. “Streamlining inter-operation memory communica-tion via data dependence prediccommunica-tion,” Proc. 30th Annual Int’l. Symposium on Micro-architecture, December 1–3, Research Triangle Park, N.C., 235–245.

Moshovos, A., S. Breach, T. N. Vijaykumar, and G. S. Sohi [1997]. “Dynamic speculation and synchronization of data dependences,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo.

Nicolau, A., and J. A. Fisher [1984]. “Measuring the parallelism available for very long instruction word architectures,” IEEE Trans. on Computers C-33:11 (November), 968–976.

Pan, S.-T., K. So, and J. T. Rameh [1992]. “Improving the accuracy of dynamic branch prediction using branch correlation,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 76–84.

Postiff, M.A., D. A. Greene, G. S. Tyson, and T. N. Mudge [1999]. “The limits of instruc-tion level parallelism in SPEC95 applicainstruc-tions,” Computer Architecture News 27:1 (March), 31–40.

Ramamoorthy, C. V., and H. F. Li [1977]. “Pipeline architecture,” ACM Computing Sur-veys 9:1 (March), 61–102.

Rau, B. R. [1994]. “Iterative modulo scheduling: An algorithm for software pipelining loops,” Proc. 27th Annual Int’l. Symposium on Microarchitecture, November 30–

December 2, 1994, San Jose, Calif., 63–74.

Rau, B. R., C. D. Glaeser, and R. L. Picard [1982]. “Efficient code generation for horizon-tal architectures: Compiler techniques and architectural support,” Proc. Ninth Annual Int’l. Symposium on Computer Architecture (ISCA), April 26–29, 1982, Austin, Tex., 131–139.

Rau, B. R., D. W. L. Yen, W. Yen, and R. A. Towle [1989]. “The Cydra 5 departmental supercomputer: Design philosophies, decisions, and trade-offs,” IEEE Computers 22:1 (January), 12–34.

Riseman, E. M., and C. C. Foster [1972]. “Percolation of code to enhance paralleled dis-patching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–

1415.

Rymarczyk, J. [1982]. “Coding guidelines for pipelined processors,” Proc. Symposium Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 1–3, 1982, Palo Alto, Calif., 12–19.

Sharangpani, H., and K. Arora [2000]. “Itanium Processor Microarchitecture,” IEEE Micro, 20:5 (September–October), 24–43.

Sinharoy, B., R. N. Koala, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner [2005].

“POWER5 system microarchitecture,” IBM J. of Research and Development, 49:4–5, 505–521.

Sites, R. [1979]. Instruction Ordering for the CRAY-1 Computer, Tech. Rep. 78-CS-023, Dept. of Computer Science, University of California, San Diego.

Skadron, K., P. S. Ahuja, M. Martonosi, and D. W. Clark [1999]. “Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation tech-niques,” IEEE Trans. on Computers, 48:11 (November).

Smith, A., and J. Lee [1984]. “Branch prediction strategies and branch-target buffer design,” Computer 17:1 (January), 6–22.

Smith, B. J. [1978]. “A pipelined, shared resource MIMD computer,” Proc. Int’l. Conf. on Parallel Processing (ICPP), August, Bellaire, Mich., 6–8.

Smith, J. E. [1981]. “A study of branch prediction strategies,” Proc. Eighth Annual Int’l.

Symposium on Computer Architecture (ISCA), May 12–14, 1981, Minneapolis, Minn., 135–148.

Smith, J. E. [1984]. “Decoupled access/execute computer architectures,” ACM Trans. on Computer Systems 2:4 (November), 289–308.

Smith, J. E. [1989]. “Dynamic instruction scheduling and the Astronautics ZS-1,”

Computer 22:7 (July), 21–35.

Smith, J. E., and A. R. Pleszkun [1988]. “Implementing precise interrupts in pipelined processors,” IEEE Trans. on Computers 37:5 (May), 562–573. (This paper is based on an earlier paper that appeared in Proc. 12th Annual Int’l. Symposium on Computer Architecture (ISCA), June 17–19, 1985, Boston, Mass.)

Smith, J. E., G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, C. M. Rozewski, D. L.

Fowler, K. R. Scidmore, and J. P. Laudon [1987]. “The ZS-1 central processor,” Proc.

Second Int’l. Conf. on Architectural Support for Programming Languages and Oper-ating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 199–204.

Smith, M. D., M. Horowitz, and M. S. Lam [1992]. “Efficient superscalar perfor-mance through boosting,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 248–259.

Smith, M. D., M. Johnson, and M. A. Horowitz [1989]. “Limits on multiple instruction is-sue,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, 290–302.

Sodani, A., and G. Sohi [1997]. “Dynamic instruction reuse,” Proc. 24th Annual Int’l.

Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo.

Sohi, G. S. [1990]. “Instruction issue logic for high-performance, interruptible, multi-ple functional unit, pipelined computers,” IEEE Trans. on Computers 39:3 (March), 349–359.

Sohi, G. S., and S. Vajapeyam [1989]. “Tradeoffs in instruction format design for horizon-tal architectures,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, 15–25.

Sussenguth, E. [1999]. “IBM’s ACS-1 Machine,” IEEE Computer 22:11 (November).

Tendler, J. M., J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy [2002]. “Power4 sys-tem microarchitecture,” IBM J. of Research and Development, 46:1, 5–26.

Thorlin, J. F. [1967]. “Code generation for PIE (parallel instruction execution) comput-ers,” Proc. Spring Joint Computer Conf., April 18–20, 1967, Atlantic City, N.J., 27.

Thornton, J. E. [1964]. “Parallel operation in the Control Data 6600,” Proc. AFIPS Fall Joint Computer Conf., Part II, October 27–29, 1964, San Francisco, 26, 33–40.

Thornton, J. E. [1970]. Design of a Computer, the Control Data 6600, Scott, Foresman, Glenview, Ill.

Tjaden, G. S., and M. J. Flynn [1970]. “Detection and parallel execution of independent instructions,” IEEE Trans. on Computers C-19:10 (October), 889–895.

Tomasulo, R. M. [1967]. “An efficient algorithm for exploiting multiple arithmetic units,”

IBM J. Research and Development 11:1 (January), 25–33.

Tuck, N., and D. Tullsen [2003]. “Initial observations of the simultaneous multithreading Pentium 4 processor,” Proc. 12th Int. Conf. on Parallel Architectures and Compila-tion Techniques (PACT’03), September 27–October 1, New Orleans, La., 26–34.

Tullsen, D. M., S. J. Eggers, and H. M. Levy [1995]. “Simultaneous multithreading: Max-imizing on-chip parallelism,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy, 392–403.

Tullsen, D. M., S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm [1996].

“Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor,” Proc. 23rd Annual Int’l. Symposium on Computer Archi-tecture (ISCA), May 22–24, 1996, Philadelphia, Penn., 191–202.

Wall, D. W. [1991]. “Limits of instruction-level parallelism,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif., 248–259.

Wall, D. W. [1993]. Limits of Instruction-Level Parallelism, Research Rep. 93/6, Western Research Laboratory, Digital Equipment Corp., Palo Alto, Calif.

Weiss, S., and J. E. Smith [1984]. “Instruction issue logic for pipelined supercomputers,”

Proc. 11th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1984, Ann Arbor, Mich., 110–118.

Weiss, S., and J. E. Smith [1987]. “A study of scalar compilation techniques for pipelined supercomputers,” Proc. Second Int’l. Conf. on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 105–109.

Wilson, R. P., and M. S. Lam [1995]. “Efficient context-sensitive pointer analysis for C programs,” Proc. ACM SIGPLAN’95 Conf. on Programming Language Design and Implementation, June 18–21, 1995, La Jolla, Calif., 1–12.

Wolfe, A., and J. P. Shen [1991]. “A variable instruction stream extension to the VLIW architecture,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif.,

在文檔中 Parallelism (Chapter 3 and Appendices C and H) L-26 L.6 The Development of SIMD Supercomputers, Vector (頁 35-44)