A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3D Architectures ∗

(1)

A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3D Architectures

^∗

Jongman Kim

^†

Chrysostomos Nicopoulos

^†

Dongkook Park

^†

Reetuparna Das

^†

Yuan Xie

^†

N. Vijaykrishnan

^†

Mazin S. Yousif

^‡

Chita R. Das

^†

†Dept. of CSE, The Pennsylvania State University ^‡Corporate Technology Group

University Park, PA 16802 Intel Corp.

{

jmkim,nicopoul,dpark,rdas,

Hillsboro, OR 97124

yuanxie,vijay,das

}

@cse.psu.edu mazin.s.yousif@intel.com

ABSTRACT

Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core architectures. Partitioning a larger die into smaller segments and then stacking them in a 3D fashion can significantly reduce latency and energy consumption. Such benefits emanate from the notion that inter-wafer distances are negligible compared to intra-wafer distances. This attribute substantially reduces global wiring length in 3D chips. The work in this paper integrates the in- creasingly popular idea of packet-based Networks-on-Chip (NoC) into a 3D setting. While NoCs have been studied extensively in the 2D realm, the microarchitectural ramifications of moving into the third dimension have yet to be fully explored. This paper presents a detailed exploration of inter-strata communication architectures in 3D NoCs. Three design options are investigated; a simple bus- based inter-wafer connection, a hop-by-hop standard 3D design, and a full 3D crossbar implementation. In this context, we propose a novel partially-connected 3D crossbar structure, called the 3D Dimensionally-Decomposed (DimDe) Router, which provides a good tradeoff between circuit complexity and performance benefits. Simulation results using (a) a stand-alone cycle-accurate 3D NoC simulator running synthetic workloads, and (b) a hybrid 3D NoC/cache simulation environment running real commercial and scientific benchmarks, indicate that the proposed DimDe design provides latency and throughput improvements of over 20% on average over the other 3D architectures, while remaining within 5%

of the full 3D crossbar performance. Furthermore, based on synthesized hardware implementations in 90 nm technology, the DimDe architecture outperforms all other designs− including the full 3D crossbar− by an average of 26% in terms of the Energy-Delay Product (EDP).

∗This research was supported in part by NSF grants, EIA-0202007, CCF-0429631, CNS-0509251, CAREER 0093085, and a grant from DARPA/MARCO GSRC.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ISCA’07, June 9–13, 2007, San Diego, California, USA.

Categories and Subject Descriptors

B.4 [Input/Output and Data Communications]: Interconnections (Subsystems); B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids.

General Terms

Design, Performance.

Keywords

Network-on-Chip (NoC), 3D Integration, 3D Architecture.

1. INTRODUCTION

Interconnects play a dominant role in shaping the power and performance profiles of processors designed using deep submicron technologies. The trend towards integrating multiple cores onto the same chip across a spectrum of devices− from high-end server chips to embedded cores− is further accentuating the importance of on-chip interconnect design. The lack of a good interconnect fabric can be envisioned to result in problems similar to traffic chaos in a large city without a proper roadway infrastructure. Tech- nology scaling effects aggravate interconnect problems [1], espe- cially those of global wires. While gate delays have reduced con- stantly, the increased resistance of the wires in newer technologies has increased global wire delays [25]. Consequently, wire delays have become quite significant, requiring multiple clock cycles for traversal across the edges of a microprocessor, and requiring architectural innovation such as Non-Uniform Cache Access (NUCA) architectures. Furthermore, signal integrity and reliability concerns such as inter-wire crosstalk and electromigration effects motivate the need for a structured design approach to the interconnect problem. The Network on-Chip (NoC) paradigm has been proposed as a scalable and structured approach for interconnect design [16, 10]. The design of 2D on-chip interconnects has been examined from various aspects, such as performance, power and reliability [30, 36, 24, 31, 42, 35] and some commercial offerings already de- ploy such networks [2, 3]. The advent of three-dimensional (3D) stacked technologies provides a new horizon for on-chip interconnect design.

3D chip technology promises to reduce interconnect delays by stacking multiple layers on top of each other, and by providing shorter vertical connections [44]. 3D technology has matured and demystified some of the concerns on thermal viability and reliability of inter-wafer vias. In addition, it promises to enable integration of heterogeneous technologies on the same chip− such as having layers of memory stacked on top of processor cores− and is even attractive for placing analog and digital components on the same chip, as this avoids common substrate noise problems. Interconnect

(2)

architecture design across the layers in a 3D architecture requires careful attention for the components on different layers to commu- nicate effectively. Furthermore, there is a need for an integrated approach to interconnect design in the 2D planes and the vertical direction. Currently, there exists no systematic effort at exploring the interconnect architecture for 3D chips. Recently, researchers have started examining some tradeoffs, such as the influence of bandwidth variation of inter-layer interconnects between processor and memory subsystems [28], and combining vertical interconnects with an NoC fabric for chip multiprocessor caches [32].

In this work, we investigate various architectural options for 3D NoC design. Interconnect design in 3D chips imposes new constraints and opportunities compared to that of 2D NoC design. There is an inherent asymmetry in the delays in a 3D architecture between the fast vertical interconnects and the horizontal interconnects that connect neighboring cores, due to differences in wire lengths (few tens ofµm in the vertical direction as compared to few thousand µm in the horizontal direction). Consequently, extending a tradi- tional NoC fabric to the third dimension by simply adding routers at each layer (called the Symmetric NoC in this work, due to the sym- metry of routing in all directions) is not a good option, as router latencies may dominate the fast vertical interconnect. Hence, we explore two alternate options; a 3D NoC-bus hybrid structure and a true 3D router fabric for the vertical, inter-strata interconnect. A key challenge with 3D NoC routers is limiting the arbitration complexity due to the large path diversity resulting from the additional interconnects in the third dimension.

Silicon Substrate (Active Layer)

Metal Layers

Layer XLayer X+1

Face-to-Back (F2B) Bonding

~0.2x0.2 to 10x10 um

~4x4 um (MIN achieved)

~50 um

Vertical Interconn. (Via) Via Pad

4 um 4 um

4 um4 um

Interconnects coming out of the paper Top View

4x4 um Via Pads assumed in this paper

Figure 1: Face-to-Back (F2B) Bonding and the Assumed Verti- cal Via Layout in this Paper

Vertical interconnects also impose a larger area overhead than corresponding horizontal wires due to the requirement for bonding pads, and can compete with device area as the inter-strata vias punch through the wafer when Face-to-Back (F2B) bonding (see Figure 1) is used. Therefore, the desired number of vertical interconnects used in the 3D router architecture needs to be investigated.

In exploring these tradeoffs in a 3D router design, we developed a new 3D NoC router architecture that we call the 3D Dimensionally- Decomposed (DimDe) Router. The name is a direct corollary of the fact that communication flow through the DimDe router is classi- fied according to the three axes in Euclidean space: X (corresponding to East-West intra-layer traffic), Y (corresponding to North- South intra-layer traffic), and Z (corresponding to inter-layer traffic in the vertical dimension). The idea of decomposing traffic in two dimensions in a 2D environment was introduced in [38, 27] and revisited more recently in [30]. While our proposed DimDe router was inspired by the work in [30], our contribution goes well be- yond the introduction of a new traffic dimension. The DimDe router fuses the crossbars of all the routers in the same vertical “column”

(i.e. same X, Y coordinate but different Z coordinate) into a unified entity which allows coordinated concurrent communication across different layers through the same crossbar. This design amounts to a true physical 3D crossbar (unlike the mere stacking of 2D routers in multiple wafer layers). It is important to note that 3D topologies

have long been in existence in the macro-network field (e.g. k-ary n-cube), but these rely on 2D routers connected in such a way as to form a logical 3D topology. However, 3D chip integration is now enabling the creation of a true physical 3D topology, where the router is itself a three-dimensional entity.

DimDe exhibits the following characteristics that make it a desirable interconnect structure for 3D designs:

(1) DimDe supports a true 3D crossbar structure which spans all the active layers of the chip. Irrespective of the number of layers used in the implementation, the 3D crossbar allows a single-hop connection between any two layers, treating all strata as part of a single router structure.

(2) The DimDe design-space provides options for varying the number of vertical connections from one to four to emulate anything between a segmented bus and a full crossbar. Through design space exploration, DimDe was selected to support two vertical interconnects to strike a balance between the path diversity and high bandwidth offered by a full 3D crossbar and the simplicity of a bus.

Most importantly, DimDe’s partially-connected crossbar achieves performance levels similar to those of a full 3D crossbar, with sub- stantially reduced area and power overhead and orders of magni- tude lower control logic complexity.

(3) DimDe supports segmented vertical (i.e. inter-strata) links in the partially-connected crossbar to enable concurrent communica- tion between the different layers of the 3D chip. This simultaneous data transfer in the vertical dimension significantly increases the vertical bandwidth of the chip as compared to a 3D NoC-bus hybrid structure.

(4) The DimDe design employs a hierarchical arbitration scheme for inter-strata transfers that reduces area and delay complexity, while still efficiently enabling simultaneous data transfers. The first stage arbitrates between all requests for vertical communication from within a single layer and the second stage accommodates as many simultaneous requests from the winners of the first stage arbitration.

(5) Similar to the Row-Column (RoCo) Decoupled Router of [30], DimDe completely separates East-West and North-South intra-layer traffic through a pre-sorting operation at the input. However, inter- layer traffic cannot be completely isolated in its own module. A true 3D crossbar requires inter-layer traffic to merge with intra- layer traffic in a seamless fashion; this would allow incoming packets from different layers to continue traversal in the destination layer. DimDe facilitates this tight integration by augmenting the Row (East-West) and Column (North-South) modules with a Ver- tical Module which fuses with the other two. The Vertical Module then extends to all other layers and unifies them in a single op- erational entity. The Vertical Module assumes the double role of

“gluing” all the layers together and blending inter- and intra-layer traffic through unidirectional connections to the Row and Column modules of all layers. It will be demonstrated that this approach dramatically reduces the 3D crossbar complexity, while still allowing concurrent communication between different layers through the switch.

We compare our proposed 3D router design to four different interconnect architectures: a 2D NoC, a 3D Symmetric NoC, a 3D NoC-Bus Hybrid, and a Full 3D Crossbar¹ implementation. To provide as comprehensive an evaluation as possible, we employed a two-pronged simulation environment: (a) a stand-alone, cycle- accurate NoC simulator running synthetic workloads, and (b) a hybrid NoC/cache simulator running a variety of commercial and scientific workloads within the context of a shared, multi-bank NUCA

1Our interpretation of a “full” 3D crossbar is presented in Section 3.4 and subsequently formalized in Section 5.3.

(3)

L2 cache in an 8-CPU Chip Multi-Processor (CMP) scenario. This double-faceted evaluation process ensures exposure to several traffic patterns, including request/reply memory traffic.

The proposed DimDe design consistently provides the lowest latency for different traffic patterns and it saturates at higher workloads compared to other considered architectures. Our synthetic workload results show that, for high traffic loads, the recently proposed 3D NoC-Bus Hybrid Architecture [32] exhibits the worst latency and throughput for all traffic patterns (even worse than the 2D topology), as the bus saturates first with higher workload. In terms of throughput behavior, the DimDe architecture provides 18% average improvement over the other designs, while remaining within around 3% of the Full 3D Crossbar’s throughput. The real workload results indicate that DimDe provides an average improvement of 27% over the 3D Symmetric and 3D NoC-Bus Hybrid designs, and remains within 4% of the Full 3D Crossbar’s performance.

However, with the Energy-Delay Product (EDP) as the metric, DimDe significantly outperforms all other designs, including the Full 3D Crossbar, by 26% on average. Hence, when accounting for both performance and power consumption, the DimDe design is superior to all other 3D router architectures analyzed in this paper. To the best of our knowledge, this is the first systematic exploration and analysis of 3D interconnect architectures and their ramifications on overall system performance.

The rest of this paper is organized as follows. The next section discusses related work. Section 3 provides details of the different 3D interconnect architectures. Section 4 delves into the proposed DimDe architecture. Section 5 presents experimental results, and the conclusions are drawn in section 6.

2. RELATED WORK

The work related to this paper is summarized in three sub-sections:

Networks-on-Chip, 3D Technology, and 3D Architectures.

2.1 Networks-on-Chip

The design of efficient on-chip router architectures has been the main focus of many researchers in the past few years. Specifically, micro-architectural optimizations aimed at reducing the pipeline depth have been developed in [39, 30, 36]. The use of specula- tive switch allocation led to a 3-stage router design in [39]. By using look-ahead routing [22], whereby the routing decision for the current node is performed in the previous node, the pipeline can be reduced to two stages [30]. Recently, a single-stage router has been proposed which utilizes extensive pre-computation techniques [36]. In our design a 2-stage pipeline serves as the base architecture, without loss of generality.

In addition to these pipeline-reducing techniques, several researchers have proposed optimizations of the functional modules. For example, [29] proposed a hierarchical crossbar switch that logically and hierarchically separates the control logic to increase performance and reduce area consumption. Also, [30] proposed a decomposed crossbar to reduce contention and thereby achieve energy-efficient architectures. The RoCo work in [30] introduced the idea of decoupling the NoC router operation into two functionally independent modules, each with its own compact 2×2 crossbar. Incoming packets are sorted into path sets in the two separate modules. Our approach in this work builds upon this philosophy of dimensional decomposition, as described in section 4. The work in [21] uses the NoC router to aid cache coherence in CMPs. By keeping track of cache accesses within each router, the on-chip network now becomes an integral part of the cache coherence protocol.

2.2 3D Integration Technology

Three-dimensional integration technology [19] is an attractive

option for overcoming the barriers in interconnect scaling, offer- ing an opportunity to continue the CMOS performance trend. In a three-dimensional (3D) chip, multiple device layers are stacked together. Various 3D integration vertical interconnect technologies have been explored, including wire bonded, microbump, contact- less (capacitive or inductive), and through-via vertical interconnect [19]. Through-via interconnection has the potential to offer the greatest vertical interconnect density and therefore is the most promising one among these vertical interconnect technologies. There are two different approaches to implementing through- via 3D integration: the first one involves sequential device process, in which the front-end processing (to build the device layer) is repeated on a single wafer to build multiple active device layers, before the interconnects among devices are built. The second approach processes each active device layer separately, using conventional fabrication techniques, and then stacking these multiple device layers together using wafer-bonding technology. The latter approach requires minimal changes to the manufacturing steps and is more promising; therefore, it is adopted in our proposed architecture. Wafers can be bonded Face-to-Face (F2F) or Face-to-Back (F2B). The through wafer via in F2F wafer-bonding does not go through the thick buried silicon layer and can be fabricated with smaller via sizes. However, for 3D Integrated Circuits (IC) with more than two active layers, F2B stacking provides better scalability, and, therefore, is adopted in our architecture.

Thermal considerations have been a significant concern for 3D integration [13]. However, various techniques have been developed to address thermal issues in 3D architectures such as physical design optimization through intelligent placement [23], increasing thermal conductivity of the stack through insertion of thermal vias [13], and use of novel cooling structures [18]. Further, a re- cent work demonstrated that the areal power density is the more important design constraint in placement of the processing cores in a 3D chip, as compared to their location in the 3D stack [26].

Consequently, thermal concern can be managed as long as components with high power density are not stacked on top of each other. Architectures that stack memory on top of processor cores, or those that rely on low-power processor cores have been demonstrated to not pose severe thermal problems [11]. In spite of all these advances, one can anticipate some increase in temperature as compared to a 2D design, and also a temperature gradient across layers. Increased temperatures increase wire resistances, and consequently the interconnect delays. To capture this effect, we study the impact of temperature variations on the 3D interconnect delay to assess the effect on performance.

2.3 3D Architectures

Modern System-on-Chip (SoC) designs, such as CMPs, can benefit from 3D integration as well. For example, by placing processing memory, such as DRAM and/or L2 caches, on top of the processing core in different layers, the bandwidth between them can be significantly increased and the critical path can be shortened [33].

In this context, [32] proposed a 3D Network-in-Memory architecture and explored the challenges of managing 3D CMPs together with L2 cache design-space issues. They also proposed the use of an NoC-Bus Hybrid structure for the 3D interconnect. In this paper, we use this structure as one of the comparison points and demonstrate that our proposed architecture is superior. In [28], a CMP design with stacked memory layers is proposed. The au- thors show that the L2 cache can be removed due to the availability of wide low-latency inter-layer buses between the processing core layer and DRAM layers, and the area saved from this can be re- cycled for additional cores. Also, [40] has proposed a multi-bank uniform on-chip cache structure using 3D integration. The notion

(4)

of adding specialized system analysis hardware on separate active layers stacked vertically on the processor die using 3D IC technology is explored in [37]. The modular snap-on introspective layer collects system statistics and acts like a hardware system monitor.

3. THREE-DIMENSIONAL NETWORK-ON- CHIP ARCHITECTURES

This section delves into the exploration of possible architectural frameworks for a three-dimensional NoC network. A typical 2D NoC consists of a number of Processing Elements (PE) arranged in a grid-like mesh structure, much like a Manhattan grid. The PEs are interconnected through an underlying packet-based network fabric.

Each PE interfaces to a network router through a Network Interface Controller (NIC). Each router is, in turn, connected to four adjacent routers, one in each cardinal direction.

Expanding this two-dimensional paradigm into the third dimension poses interesting design challenges. Given that on-chip networks are severely constrained in terms of area and power resources, while at the same time they are expected to provide ultra-low latency, the key issue is to identify a reasonable tradeoff between these contradictory design threads. Our task in this section is precisely this: to explore the extension of a baseline 2D NoC implementation into the third dimension, while considering the aforementioned constraints.

3.1 The Baseline 2D NoC Architecture

A generic NoC router architecture is illustrated in Figure 2. The router hasP input and P output channels/ports. As previously mentioned,P =5 in a typical 2D NoC router, giving rise to a 5×5 crossbar. The Routing Computation unit, RC, operates on the

VC 1

VC v

Crossbar (P x P)

Routing Computation

(RC) VC Allocator

(VA) Switch Allocator (SA)

Credits in VC Identifier

Input Port P Input Channel 1Credit out

Output Channel 1

Output Channel P Generic NoC Router

Input Port 1

VC 2

Input Channel PCredit out

Figure 2: A Generic NoC Router Architecture header flit (a flit is the smallest unit of flow control; one packet is composed of a number of flits) of an incoming packet and, based on the packet’s destination, dictates the appropriate output Physical Channel/port (PC) and/or valid Virtual Channels (VC) within the selected output PC. The routing can be deterministic or adaptive.

The Virtual channel Allocation unit (VA) arbitrates between all packets competing for access to the same output VCs and chooses a winner. The Switch Allocation unit (SA) arbitrates between all VCs requesting access to the crossbar. The winning flits can then traverse the crossbar and move on to their respective output links.

Without loss of generality, all implementations in this work employ two-stage routers.

3.2 A 3D Symmetric NoC Architecture

The natural and simplest extension to the baseline NoC router to facilitate a 3D layout is simply adding two additional physical ports to each router; one for Up and one for Down, along with the associated buffers, arbiters (VC arbiters and Switch Arbiters), and crossbar extension. We call this architecture a 3D Symmetric NoC, since both intra- and inter-layer movement bear identical characteristics: hop-by-hop traversal, as illustrated in Figure 3(a). For

example, moving from the bottom layer of a 4-layer chip to the top layer requires 3 network hops.

This architecture, while simple to implement, has two major inherent drawbacks: (1) It wastes the beneficial attribute of a negligible inter-wafer distance (around 50µm per layer) in 3D chips, as shown in Figure 1. Since traveling in the vertical dimension is multi-hop, it takes as much time as moving within each layer. Of course, the average number of hops between a source and a destination does decrease as a result of folding a 2D design into multiple stacked layers, but inter-layer and intra-layer hops are indistin- guishable. Furthermore, each flit must undergo buffering and arbitration at every hop, adding to the overall delay in moving up/down the layers. (2) The addition of two extra ports necessitates a larger 7×7 crossbar, as shown in Figure 3(b). Crossbars scale upward very inefficiently, as illustrated in Table 1. This table includes the area and power budgets of all crossbar types investigated in this paper, based on synthesized implementations in 90 nm technology.

Details of the design and synthesis methodology are given in Sec- tion 5.2. Clearly, a 7×7 crossbar incurs significant area and power overhead over all other architectures. Therefore, the 3D Symmetric NoC implementation is a somewhat naive extension to the baseline 2D network.

R R

R R R

R

R R

HOP

A 3D Symmetric NoC

7x7 Crossbar

East West North South Up Down

PE

(a) Overall View (b) Crossbar Configuration

Figure 3: A 3D Symmetric NoC Network

Crossbar Type Area Power with 50%

switching activity at 500 MHz 4×2 Crossbar(for 3D DimDe) 3039.32 µm² 1.63 mW 5×5 Crossbar(Conventional 2D Router) 8523.65 µm² 4.21 mW 6×6 Crossbar(3D NoC-Bus Hybrid) 11579.10 µm² 5.06 mW 7×7 Crossbar(3D Symmetric NoC Router) 17289.22 µm² 9.41 mW

Table 1: Area and Power Comparisons of the Crossbar Switches Assessed in this Work

3.3 The 3D NoC-Bus Hybrid Architecture

The previous sub-section argues that multi-hop communication in the vertical (inter-layer) dimension is not desirable. Given the very small inter-strata distance, single-hop communication is, in fact, feasible. This realization opens the door to a very popular shared-medium interconnect, the bus. The NoC router can be hy- bridized with a bus link in the vertical dimension to create a 3D NoC-Bus Hybrid structure, as shown in Figure 4(a). This approach was first introduced in [32], where it was used in a 3D NUCA L2 Cache for CMPs. This hybrid system provides both performance and area benefits. Instead of an unwieldy 7×7 crossbar, it requires a 6×6 crossbar (Figure 4(b)), since the bus adds a single additional port to the generic 2D 5×5 crossbar. The additional link forms the interface between the NoC domain and the bus (vertical) domain.

The bus link has its own dedicated queue, which is controlled by a central arbiter. Flits from different layers wishing to move up/down should arbitrate for access to the shared medium.

Figure 5 illustrates the side view of the vertical via structure.

This schematic depicts the usefulness of the large via pads between the different layers; they are deliberately oversized to cope with

(5)

misalignment issues during the fabrication process. Consequently, it is the large vias which ultimately limit vertical via density in 3D chips.

R R

R

R R

R

R R

R

R R

HOP

BUS

A 3D NoC-Bus Hybrid NoC

6x6 Crossbar

East West North South

Up/Down

East West North South Up/Down

PE

(a) Overall View (b) Crossbar Configuration

Figure 4: A 3D NoC-Bus Hybrid Architecture

Vertical Interconnect

Via Pad Large via pad fixes misalignment

issues Non-Segmented Inter-Layer Links

Layer X-1Layer XLayer X+1

Figure 5: Side View of the Inter-Layer Via Structure in a 3D NoC-Bus Hybrid Structure

Despite the marked benefits over the 3D Symmetric NoC router of Section 3.2, the bus approach also suffers from a major draw- back: it does not allow concurrent communication in the third dimension. Since the bus is a shared medium, it can only be used by a single flit at any given time. This severely increases contention and blocking probability under high network load, as will be demonstrated in Section 5. Therefore, while single-hop vertical communication does improve performance in terms of overall latency, inter-layer bandwidth suffers.

3.4 A True 3D NoC Router

Moving beyond the previous options, we can envision a true 3D crossbar implementation, which enables seamless integration of the vertical links in the overall router operation. Figure 6 illustrates such a 3D crossbar layout. It should be noted at this point that the traditional definition of a crossbar - in the context of a 2D physical layout - is a switch in which each input is connected to each output through a single connection point. However, extending this definition to a physical 3D structure would imply a switch of enormous complexity and size (given the increased numbers of input- and output-port pairs associated with the various layers). Therefore, in this paper, we chose a simpler structure which can accommodate the interconnection of an input to an output port through more than one connection points. While such a configuration can be viewed as a multi-stage switching network, we still call this structure a crossbar for the sake of simplicity.

The vertical links are now embedded in the crossbar and extend to all layers. This implies the use of a 5×5 crossbar, since no additional physical channels need to be dedicated for inter-layer communication. As shown in Table 1, a 5×5 crossbar is significantly smaller and less power-hungry than the 6×6 crossbar of the 3D NoC-Bus Hybrid and the 7×7 crossbar of the 3D Symmetric NoC. Interconnection between the various links in a 3D crossbar would have to be provided by dedicated connection boxes at each layer. These connecting points can facilitate linkage between vertical and horizontal channels, allowing flexible flit traversal within

the 3D crossbar. The internal configuration of such a Connection Box (CB) is shown in Figure 7(a). The horizontal pass transistor is dotted, because it is not needed in our proposed 3D crossbar implementation, which is presented in Section 4. The vertical link seg- mentation also affects the via layout, as illustrated in Figure 7(b).

While this layout is more complex than that shown in Figure 5, the area between the offset vertical vias can still be utilized by other circuitry, as shown by the dotted ellipse in Figure 7(b).

Hence, the 2D crossbars of all layers are physically fused into one single three-dimensional crossbar. Multiple internal paths are present, and a traveling flit goes through a number of switching points and links between the input and output ports. Moreover, flits re-entering another layer do not go through an intermediate buffer;

instead, they directly connect to the output port of the destination layer. For example, a flit can move from the western input port of layer 2 to the northern output port of layer 4 in a single hop.

R ^HOP

HOP

A 3D Crossbar

Segmented Links (1 Hop across

ALL Layers) Connection

Box

5 x 5 Crossbar

East In West In North In South In

East Out West Out North Out South Out PE In

PE Out

Only 4 vertical links shown here for clarity

Vertical links coming out of paper (up to 25,

for a 5x5 crossbar)

Figure 6: NoC Routers with True 3D Crossbars

Layer X

Pass Transistors Up to

Layer X+1

Down to Layer X-1

Connection Box

Silicon Substrate (Active Layer)

Metal Layers

Vertical Interconnect Pass Transistor

Via Pad

Layer X-1Layer XLayer X+1

Segmented Inter-Layer Links

Area between offset interconnects can still be used!

(a) Internal Details of a Connection Box (CB) (b) Inter-layer Via Layout

Figure 7: Side View of the Inter-Layer Via Structure in a 3D Crossbar

It will be shown in Section 4 that adding a 128-bit vertical link, along with its associated control signals, consumes only about 0.01 mm²of silicon real estate.

A

B

Figure 8: A 3D 3×3×3 Crossbar in Conceptual Form However, despite this encouraging result,

there is an opposite side to the coin which paints a rather bleak picture. Adding a large number of vertical links in a 3D crossbar to increase NoC connectivity results in increased path diversity. This translates into multiple possible paths between source and destination pairs. While this increased diversity may initially look like a positive attribute, it actually leads to a dramatic increase in the complexity of the central arbiter, which coordinates inter-layer communication in the 3D crossbar. The arbiter now

needs to decide between a multitude of possible interconnections, and requires an excessive number of control signals to enable all these interconnections. Even if the arbiter functionality can be distributed to multiple smaller arbiters, then the coordination between

(6)

5x5 Monolithic

Crossbar

Flit In

East West North South PE

OUT

(a) Conventional 2D NoC Router Overview

Row Module

(East-West)

Column Module

(North-South) Guided Flit

Queuing

2x2 Crossbars Flit In East-West OutNorth-South Out

Early Ejection

(b) The 2D Row-Column (RoCo) Decoupled Router

Row Module

(East-West)

Column Module

(North-South) Guided Flit

Queuing

Vertical Module

(Up-Down) Flit In

Flits going UP-DOWN

Ejection from UP-DOWNEast-West OutNorth-South Out

Early Ejection

(c) The Proposed 3D DimDe Router Architec- ture

Figure 9: Different NoC Router Switching Mechanisms these arbiters becomes complex and time-consuming. Alterna-

tively, if dynamism is sacrificed in favor of static path assignments, the exploration space is still daunting in deciding how to efficiently assign those paths to each source-destination pair. Furthermore, a full 3D crossbar implies 25 (i.e. 5x5) Connection Boxes (see Figure 7(a)) per layer. A four-layer design would, therefore, require 100 CBs! Given that each CB consists of 6 transistors, the whole crossbar structure would need 600 control signals for the pass transistors alone! Such control and wiring complexity would most certainly dominate the whole operation of the NoC router. Pre-programming static control sequences for all possible input-output combinations would result in an oversize table/index; searching through such table would incur significant delays, as well as area and power overhead. The vast number of possible connections hinders the otherwise streamlined functionality of the switch. Note that the prevail- ing tendency in NoC router design is to minimize operational complexity in order to facilitate very short pipeline lengths and very high frequency. A full crossbar with its overwhelming control and coordination complexity poses a stark contrast to this frugal and highly efficient design methodology. Moreover, our experimental results will show that the redundancy offered by the full connectiv- ity is rarely utilized by real-world workloads, and is, in fact, design overkill.

To understand the magnitude of the path diversity issue in a true 3D crossbar (as shown in Figure 8 for a 3×3×3 example), one can picture the 3D crossbar itself as a 3D Mesh network. For the 3D 3×3×3 crossbar of Figure 8, the number of minimal paths, k, between points A and B is given in [17] as

k=„

Δx+ Δy+ Δz

Δx

« „ Δy+ Δz

Δy

«

= (Δx+ Δy+ Δz)!

Δx!Δy!Δz! (1)

whereΔx,ΔyandΔzare the numbers of hops separating A and B in the X, Y, and Z dimensions, respectively. In our example, Δx= Δy= Δz= 2. Thus, the number of minimal paths between A and B is 90. For a 3D 4×4×4 crossbar, this number explodes to 1680. If non-minimal paths are also considered, then path diversity is practically unbounded [17].

Hence, given the tight latency and area constraints in NoC routers, vertical (inter-layer) arbitration should be kept as simple as possible. This can be achieved by using a limited amount of inter-layer links. The question is then: how many links are enough? Our experiments in Section 5 demonstrate that anything beyond two links per 3D crossbar yields diminishing returns in terms of per- formance.

3.5 A Partially-Connected 3D NoC Router Architecture

The scalability problem in vertical link arbitration highlighted in the previous sub-section dictates the use of a partially-connected 3D crossbar, i.e. a crossbar with a limited number of vertical links.

The arbitration complexity can be further mitigated through the use of hierarchical arbiters. Two types of arbiters should be employed:

intra-layer arbiters, which handle local requests from a single layer, and one global arbiter per vertical link to handle requests from all

layers. This decoupling of arbitration policies can help parallelize tasks; while flits arbitrate locally in each layer, vertical arbitration decides on inter-layer traversal. These design directives were the fundamental drivers in our quest for a suitable 3D NoC implementation. As such, they form the cornerstones of our proposed architecture, which is described in detail in the following section.

4. THE PROPOSED 3D DIMENSIONALLY- DECOMPOSED (DIMDE) NOC ROUTER ARCHITECTURE

The heart of a typical two-dimensional NoC router is a monolithic, 5×5 crossbar, as depicted abstractly in Figure 9(a). The five inputs/outputs correspond to the four cardinal directions and the connection from the local PE. The realization that the crossbar is a major contributor to the latency and area budgets of a router has fueled extensive research in optimized switch designs. Through the use of a preliminary switching process, known as Guided Flit Queuing [30], incoming traffic may be decomposed into two independent streams: (a) East-West traffic (i.e. packet movement in the X dimension), and (b) North-South traffic (i.e. packet movement in the Y dimension). This segregation of traffic flow allows the use of two smaller 2×2 crossbars and the isolation of the two flows in two independent router sub-modules, as shown conceptu- ally in Figure 9(b). The resulting two compact modules are more area- and power-efficient, and provide better performance than the conventional monolithic approach.

Following this logic of traffic decomposition in orthogonal dimensions, we propose in this work the addition of a third information flow in the Z dimension (i.e. inter-layer communication). An additional module is now required to handle all traffic in the third dimension; this component is aptly called the Vertical Module. On the input side, packets are decomposed into the three dimensions (X, Y, and Z), and forwarded to the appropriate module. However, as previously mentioned, simply adding a third independent module cannot lead to a true 3D crossbar, because inter-layer traffic must be able to merge with intra-layer traffic upon arrival at the destination chip layer. A totally decoupled Vertical Module would force all packets arriving at a particular layer and wishing to continue traversal within that layer to be re-buffered and re-arbitrate for access to the Row/Column modules.

Hence, the Vertical Module must somehow fuse the Row and Column modules to allow movement of packets from the Vertical Module to the Row and Column Modules. An abstract view of the proposed 3D DimDe implementation is illustrated in Figure 9(c).

The diagram clearly shows the Vertical Module linking with the Row and Column Modules. Also notice that the communication link is one-way, i.e. from the Vertical Module to the Row/Column Modules. There is no need for the Row/Column Modules to com- municate with the Vertical Module, since intra-layer traffic wishing to change layer is pre-directed to the Vertical Module at the input of the router.

The streamlined nature of a dimensionally decomposed router

(7)

lends itself perfectly for a 3D crossbar implementation. The simplicity and compactness of the smaller, distinct modules can be utilized to create a crossbar structure which extends into the third dimension without incurring prohibitive area and latency overhead.

The high-level architectural overview of our proposed 3D DimDe router is shown in Figure 10. As illustrated in the figure, the gate- way to different layers is facilitated by the inclusion of the third, Vertical Module. The 3D DimDe router uses vertical links which are segmented at the different device layers through the use of compact Connection Boxes (CB). Figure 7(a) shows a side view cross- section of such a CB. Each box consists of 5 pass transistors which can connect the vertical (inter-layer) links to the horizontal (intra- layer) links. The dotted transistor is not needed in DimDe, because the design was architected in such a way as to avoid the case where intra-layer communication needs to pass through a CB. The CB structure allows simultaneous transmission in two directions, e.g. a flit coming from layer X+1 and connecting to the left link of Layer X, and a flit coming from layer X-1 connecting to the right link of layer X (see Figure 7(a)). The inclusion of pass transistors in the data path adds delay and degrades the signal strength due to the associated voltage drop. However, this design decision is fully justified by the fact that inter-layer distances are, in fact, negligible. To investigate the effectiveness and integrity of this connection scheme, we laid out the physical design of the CB and simulated it in HSpice using the Predictive Technology Model (PTM) [4] at 70 nm technology and 1 V power supply. The latency results for 2, 3 and 4-layer distances are shown in Table 2. Evidently, even with a four-layer design (i.e. traversing four cascaded pass transistors), the delay is only 36.12ps; this is a mere 1.8% of the 2 ns clock pe- riod (500 MHz) of the NoC router. In fact, the addition of repeaters will increase latency, because with such small wire lengths (around 50µm per layer), the overall propagation delay is dominated by the gate delays and not the wiring delay. This effect is corroborated by the increased delay of 105.14ps when using a single repeater, in Table 2.

R ^HOP

HOP

The 3D DimDe NoC

East In

West In

North In South In North Out South Out

East Out

West Out

Up/Down In

Ejection Row Module

Column Module

Vertical Module

UP/DN

UP/ DN

Up/Down In TOP VIEW

Figure 10: Overview of the 3D DimDe NoC Architecture

Inter-Layer Number of Delay Link Length Repeaters

50 µm (Layer 1 to 2) 0 7.86 ps 100 µm (Layer 1 to 3) 0 19.05 ps 150 µm (Layer 1 to 4) 0 36.12 ps 150 µm (Layer 1 to 4) 1 (layer 3) 105.14 ps

Table 2: The Inter-Layer Distance on Propagation Delay To indicate the fact that each vertical link in the proposed architecture is composed of a number of wires, we thereby refer to these links as bundles. The presence of a segmented wire bundle dictates the use of one central arbiter for each vertical bundle, which is assigned the task of controlling all traffic along the vertical link. If arbitration is carried out at a local level alone, then the benefit of concurrent communication along a single vertical bundle cannot be realized; each layer would simply be unaware of the

connection requests of the other layers. Hence, a coordinating entity is required to monitor all requests for vertical transfer from all the layers and make an informed decision, which will favor simultaneous data transfer whenever possible. Concurrent communication increases the vertical bandwidth of the 3D chip. Given the resource-constrained nature of NoCs, however, the size and operational complexity of the central arbiter should be handled ju- diciously. The goal is not to create an overly elaborate mechanism which provides the best possible matches over several clock cycles.

Our objective was to obtain reasonably intelligent matches within a single clock cycle.

To achieve this objective, we divided the arbitration for the vertical link into two stages, as shown at the top of Figure 11(a). The first stage is performed locally, within each layer. This stage arbitrates over all flits in a single layer which request a transfer to a different layer. Once a local winner is chosen, the local arbiter no- tifies the second stage of arbitration, which is performed globally.

This global stage takes in all winning requests from each layer and decides on how the segmented link will be configured to accommodate the inter-layer transfer(s). The arbiter was designed in such a way as to realize the scenarios which are suitable for concurrent communication.

Figure 11(b) illustrates all possible requests to the global arbiter of a particular vertical bundle, assuming a 4-layer chip configuration using the deterministic XYZ routing. The designations L1, L2, and L3 indicate the different segments of the vertical bundle; L1 is the link between layers 1 and 2, L2 is the link between layers 2 and 3, and so on. As an example, let us assume that a flit in layer 1, which wants to go to layer 2, has won the local arbitration of layer 1; global request signal 1 (see Figure 11(b)) is asserted. Similarly, a flit in layer 2 wants to go to layer 3; global request signal 5 is asserted. Finally a flit in layer 3 wants to go to layer 4; global request signal 9 is asserted. The global arbiter is designed to recognize that the global request combination 1, 5, 9 (black boxes in Figure 11(b)) results in full concurrent communication between all partic- ipating layers. It will, therefore, grant all requests simultaneously.

All combinations which favor simultaneous, non-overlapping communication are programmed into the global arbiter. If needed, these configurations can be given higher priority in the selection process.

The arbiter can be placed on any layer, since the vertical distance to be traveled by the inter-layer control signals is negligible. The aforementioned two arbitration stages suffice only if deterministic XYZ routing is used. In this case, a flit traveling in the vertical (i.e. Z) dimension will be ejected to the local PE upon arrival at the destination layer’s router. If, however, a different routing algorithm is used, which allows flits coming from different layers to continue their traversal in the destination layer, then an additional local arbitration stage is required to handle conflicts between flits arriving from different layers and flits residing in the destination layer. The third arbitration stage, illustrated at the bottom of Figure 11(a), will take care of such Inter-Intra Layer (IIL) conflicts. The use of non- XYZ algorithms also complicates the request signals sent across different layers. It is no longer enough to merely indicate the destination layer; the output port designation on the destination layer also needs to be sent. IIL conflicts highlight the complexity in- volved in coordinating flit traversal in a 3D network environment.

An example of the use of a non-XYZ routing algorithm is presented in Figure 12, which tracks the path of a flit traveling from Layer X to the eastern output of Layer X+1. In this case, the flit changes layer and continues traversal in a different layer.

Each vertical bundle in DimDe consists of a number of data wires (128 bits in this work), and a number of control wires to/from a central arbiter which coordinates flit movement in the vertical di-

(8)

Stage 2

XYZ Routing

Local Arbitration (Pick 1 winner among all flits requesting change-of-layer)

Stage 1

Global Arbitration (Pick 1 winner among all Stage 1

winners)

Stage 2

Other Routing

Local Arbitration (Pick 1 winner among all flits requesting change-of-layer)

Stage 1

Global Arbitration (Pick 1 winner among all Stage 1

winners)

Local Arbitration (Resolve Inter- Intra Layer (IIL) conflicts between change-of-layer

and local flits) Stage 3

(a) Vertical (Inter-Layer) Link Arbitration Stages

Input Layer

Output Layer

L1 L2 L3 Global Req. Signal

1

2

3

4 2

2

2 3

3

3 4

4

4 1

1

(1) (2) (3) (4) (5) (6)

(7) (8) (9) (10) (11) (12)

(b) 3D Global Arbitration Re- quest Scenarios

Figure 11: Vertical (Inter-Layer) Link Arbitration Details mension. These control signals include: (a) Request signals from

all layers to the central arbiter indicating the requested destination layer (and possibly output port, depending on the routing algorithm used), and the corresponding acknowledgement signals from the arbiter. (b) Enable signals from the arbiter to the pass transistors of the Connection Boxes of each layer spanned by the wire bundle.

The total number of wires,w, in a vertical bundle is given by w =n _b_{+ 2(n − 1)}₂+ 5(n − 1), if XY Z algorithm

b+ 2(n − 1)²+ 6(n − 1) + 5(n − 1), otherwise

where _b = number of data bits/wires,

2(n − 1)² = number of request/acknowledgement signals to/f rom the central arbiter assuming an n− layer chip, 6(n − 1) = number of additional signals sent

to/f rom the arbiter f or output port designation(3 − bit designation for the f our possible output ports and the ejection port) when a non − XY Z routing algorithm is employed, 5(n − 1) = number of enable signals for the pass

transistors of the CB of each layer.

Assuming a 4-layer configuration (n = 4), XYZ routing, and 128 data bits (i.e. b = 128), the number of wires in a vertical bun- dle,w, is 161. Based on the square-like layout of Figure 1, the area consumed by the bundle is around 10,000µm²= 0.01mm². This amounts to a vertical via density of around 1.5 million individual wires percm². This result illustrates the fact that increasing the number of vertical vias is, in fact, feasible in terms of area consumption by the wires themselves. However, as explained in Section 3.4, adding extra vertical bundles in the 3D crossbar is prohibitive in terms of arbitration complexity; the area, power and latency increases incurred by a highly-complex arbitration scheme negate any advantages provided by the increased number of inter- layer bundles. Furthermore, it will be demonstrated later on that increasing the number of inter-layer bundles yields rapidly diminishing returns in terms of performance gain under both synthetic and real workloads. A detailed view of the proposed 3D DimDe architecture is shown in Figure 13. DimDe employs Guided Flit Queuing [30] to guide incoming flits to an appropriate Path Set (PS). Guided Flit Queuing is a preliminary switching operation at the input of the router which utilizes the look-ahead routing information present in incoming header flits. This information de- notes the requested output path; thus, incoming traffic can be decomposed into the X, Y, and Z dimensions. The Vertical Module adds two extra path sets to the 2D implementation. One path set is used by incoming flits from the East-West (intra-layer) dimension, and the other for flits from the North-South dimension. Just like Guided Flit Queuing, the Early Ejection Mechanism [30] uses

the look-ahead routing information to identify packets which need to be ejected to the local PE. This enables such flits to bypass the destination router and be directly ejected to the NIC. The Verti- cal Module consists of two bidirectional vertical bundles, one for each of the two path sets. Note that the number of vertical bundles can be varied from four to one. Each vertical link has one input connection and three output connections on each layer. The input connection comes from the associated path set’s MUX (see dark box in the middle of Figure 13). The three output connections are as follows: (1) One connection to the Row Module Crossbar for flits which arrive from other layers and need to continue traversal in the East-West dimension of the current layer. (2) One connection to the Column Module crossbar for flits which need to continue in the North-South dimension of the current layer. (3) One connection for ejection to the Network Interface Controller (NIC) of the local PE. This configuration implies that the Row and Module crossbars need to grow in size from 2×2 in the 2D case to 4×2 in DimDe to accommodate the two additional connections from the two vertical links. Despite this increase in size, two 4×2 crossbars are still substantially smaller than a single monolithic 6×6 or 7×7 crossbar, as illustrated in Table 1. Once again, it is precisely for this reason that we chose to use this architecture in our 3D NoC implementation.

The Vertical Module of the proposed DimDe router uses two Path Sets to group the available Virtual Channels. As shown in Figure 14, the DimDe router requires 5 VCs for correct functionality under a deterministic, deadlock-free algorithm: one VC for injection from each of the four incoming directions, and one for injection from the local PE. The sixth VC can be used as a drain channel for deadlock recovery under adaptive routing algorithms.

Moreover, depending on the algorithm used, additional VCs can be added to the two Vertical Module path sets to ensure deadlock free- dom. These drain VCs need to operate on deadlock-free algorithms to guarantee deadlock breakup [20]. In this work, we concentrated on deterministic XYZ and ZXY algorithms as a proof of concept of the proposed architecture. Since these algorithms are inherently deadlock-free, the sixth VC buffer was used as an additional injection VC from the local PE.

As previously explained in Section 2.2, thermal issues are of ut- most importance in 3D chips. Stacking several active layers with minimal distance in-between favors the creation of hotspots. From a 3D NoC perspective, it was important to investigate the effect of high temperature on the propagation delay of the signals on the vertical (inter-layer) interconnects. To that extend, the propagation delay between the layers was modeled as an RC ladder (Figure 15(b)) to accurately capture the distributed resistance, capacitance, and temperature variations along the inter-strata vias. The resistance of metals is affected by temperature, and it was modeled using equa- tions from [12]. Assuming aTLayer1temperature of85^◦C and a fixed linear temperature gradient between each layer, the propagation delay of these vias was simulated in HSpice with the required

(9)

Up/Down In

Ejection East In

West In

East Out

West Out North Out South Out

North In South In Row Module

Column Module Up/Down In

East Out Row

Module

Column Module

Layer X Layer X+1

Flit Going UP

Figure 12: An Example of a Non-XYZ Routing Algorithm

VC 1 VC 2 VC 3 VC Identifier

From East

From West

From North

From South

Ejection to PE

Row Module (East-West) 4x2 Crossbar

Column Module (North-South) 4x2 Crossbar

Ejection to PE (from UP/DOWN) To UP/

DOWN

To UP/

DOWN From UP/DOWN

From UP/DOWN

Vertical Module

To West

To East

To South

To North From PE

Guided Flit Queuing

Path Set (PS)

Figure 13: Architectural Detail of the Proposed 3D DimDe NoC Router

temperature annotations. Even in the worst case of a10^◦C temperature increase per layer for 8 layers, the total propagation delay from the lowest to the highest layer was only 0.11ps and, therefore, considered inconsequential for our work. The results of the thermal analyses are summarized in Figure 15(a).

To UP/

DOWN

Vertical Module

To UP/

DOWN From East

From West From PE

From North From South Free VC!

Figure 14: Virtual Channel Assignments in the Vertical Module of DimDe

0 0.02 0.04 0.06 0.08 0.1 0.12

0 2 4 6 8 10

Temperature difference between each layer (C) Propagation delay from lowest to highest layer (ps)

2 Layers 4 Layers 6 Layers 8 Layers

(a) Inter-Layer Propagation Delay vs. Temperature (b) Modeling of Temperature Effect on Propagation Delay

Figure 15: Thermal Effects on Inter-Layer Propagation Delay

5. PERFORMANCE EVALUATION

In this section, we present simulation-based performance evaluation of our architecture, a generic 2D router architecture, a 3D Symmetric NoC design, the 3D NoC-Bus Hybrid architecture, and the Full 3D Crossbar implementation, in terms of network latency, throughput and power consumption under various traffic patterns.

Our experimental methodology is followed by the experimental results.

5.1 Simulation Platform

A double-faceted evaluation environment was implemented in order to conduct a detailed evaluation of the router architectures analyzed in this paper: (a) A cycle-accurate stand-alone 3D NoC simulator was developed, which accurately models the routers, the interconnection links and vertical pillars, as well as all the architectural features of the various NoC architectures under investigation.

The simulator was built by augmenting an existing 2D NoC simulator and models each individual component within the router ar-

chitecture, allowing for detailed analysis of component utilizations and flit flow through the network. The activity factor of each component is used for analyzing power consumption within the network. In addition to the network-specific parameters, our simulator accepts hardware parameters such as power consumption (dynamic and leakage) for each component and overall clock frequency. This leg of the simulation process examines the behavior of all the architectures under synthetic workloads.

(b) To provide a more diversified simulation environment, we also implemented a detailed trace-driven cycle-accurate hybrid NoC/

cache simulator for CMP architectures. The memory hierarchy implemented is governed by a two-level directory cache coherence protocol. Each core has a private write-back L1 cache (split L1 I and D cache, 64 KB, 2-way, 3-cycle access). The L2 cache is shared among all cores and split into banks (32 banks, 512 KB each for a total of 16 MB, 6-cycle bank access). An underlying NoC model connects the L2 banks. The L1/L2 block size is 64 B. Our coherence model includes a MESI-based protocol with distributed directories, with each L2 bank maintaining its own local directory. The simulated memory hierarchy mimics SNUCA [9].

The sets are statically placed in the banks depending on the low order bits of the address tags. The network timing model simulates all kinds of messages: invalidates, requests, replies, write-backs, and acknowledgements. The interconnect model is the same as (a) above. The off-chip memory is a 4 GB DRAM with a 260-cycle access time.

Detailed instruction traces of four commercial server workloads were used: (1) TPC-C [8], a database benchmark for online trans- action processing (OLTP), (2) SAP [5], a sales and distribution benchmark, and (3) SJBB [7] and (4) SJAS [6], two Java-based server benchmarks. The traces− collected from multiprocessor server configurations at Intel Corporation− were then run through our NoC/cache hybrid simulator to measure network statistics. Ad- ditionally, a second set of memory traces was generated by exe- cuting programs from SPLASH [43], a suite of parallel scientific benchmarks, on the Simics full system simulator [34]. Specifically, barnes, ocean, water-nsquared (wns), water-spatial (wsp), lu, and radiosity (rad) were used. The baseline configuration is the Solaris 9 Operating system running on eight UltraSPARC III cores. Bench- marks execute 16 parallel threads. Again, the number of banks for the L2 shared cache is 32. Thus, 32 nodes are present in the NoC network, 8 of which are also CPU nodes.

5.2 Energy Model

The proposed components of the 3D router architectures, and a generic two-stage 5-port router architecture were implemented in structural Register-Transfer Level (RTL) Verilog and then synthe-