Hardware-Software Co-Design of Resource Constrained Systems on a Chip

(1)

Hardware-Software Co-Design of Resource Constrained Systems on a Chip

Nattawut Thepayasuwan and Alex Doboli Department of Electrical and Computer Engineering

State University of New York at Stony Brook Stony Brook, NY, 11794-2350

Email:

{

nattawut, adoboli

}

@ece.sunysb.edu

ABSTRACT

This paper presents a hardware-software co-design methodology for resource constrained SoC fabricated in a deep submicron pro- cess. The novelty of the methodology consists in contemplat- ing critical hardware and layout aspects during system level de- sign for latency optimization. The eﬀect of interconnect para- sitic and delays is considered for characterizing bus speed and data communication times. The methodology permits coarse and medium grained resource sharing across tasks for execution speed- up through superior usage of hardware. The paper oﬀers experi- ments for the proposed co-design methodology, including a JPEG SoC.

Keywords: hardware/software co-design, bus architectures, trade-oﬀs, optimization, layout awarness

1. INTRODUCTION

Systems-on-Chip (SoC) are single-chip implementations of embedded systems. They includes multiple IP cores connected through complex data, address and control buses.

The variety of IP cores is large. It is foreseen that the number of SoC cores will steadily increase over the next 4-7 years. Eﬀectively designing SoC necessitates the development of new design automation tools at various levels of abstraction, including system, logic and layout level [2] [6].

This paper presents a hardware-software co-design methodology of SoC to address (a) the dependency of task communication speed and time on layout attributes like interconnect parasitic, as well as (b) the possibility of contemplating coarse grained (like processors and memories) and medium grained (i.e. multipliers, decoders etc) hardware resource sharing to improve system latency without increasing cost.

The methodology incorporates an original algorithm for bus architecture synthesis. All hardware resources are assumed to be given in this methodology. The co-design process per- forms combined task and communication partitioning and scheduling, and ﬁnds feasible requirements for the minimum bus speed of each data communication link. The bus architecture of an SoC is found as part of the process. The paper oﬀers experiments, including bus architecture synthesis for a JPEG SoC design. The co-design methodology improves the practicality of system-level design for VDSM technologies. It is also useful for implementing resource constrained SoC needed for multimedia applications.

The co-design methodology includes three main parts: (1) the step of combined partitioning and scheduling followed by (2) the step of bus architecture synthesis, and (3) re-

scheduling of tasks, operations, and communications for the best found bus architecture. The ﬁrst step is an exploration process based on simulated annealing algorithm (SA). We propose Performance Models (PM), a graph-based description to capture the relationships between performance (i.e.

latency and communication speed flexibility), graph characteristics (like data and control dependencies), and design decisions (such as binding and scheduling). PM are general, flexible, and can be easily extended for new design activi- ties without requiring cumbersome validation. The first step ends with the creation of a Core Graph structure (CG) that expresses the data volume and timing for each communications link. The second step uses CG to synthesize and route bus architectures for an SoC. IP cores are placed using a hierarchical cluster growth algorithm, which places highly communicating cores close to each other.

A variety of hardware/software co-design methodologies have been proposed for optimizing cost, speed, and power consumption [4]. Depending on the targeted applications, co-design approaches can be classiﬁed into three groups: for data dominated systems [5], for control intensive systems [1], and for applications with substantial data processing and reduced amount of control [9]. Recently, Sgroi et al [10] suggest a communication centric approach motivated by the increasing importance of communication attributes for SoC archtecture design. Bus design is critical for SoC.

Early work on bus and communication synthesis [3] [8] [12]

focuses on multiprocessor embedded systems on a printed board. Research addresses interface design [3] [8], performance evaluation [7], mapping and scheduling [12]. This work does not tackle the hardware and layout aspects of SoC communication sub-systems.

This paper proposes a new hardware-software co-design approach that integrates system design with bus architecture synthesis and routing, and contemplates hardware sharing more aggressively. The paper is organized as follows.

Section 2 discusses the proposed system representation. Sec- tion 3 introduces the co-design approach. Bus architecture synthesis is presented next. Experimental results are given in Section 5. Finally, conclusions are oﬀered.

2. SYSTEM REPRESENTATION FOR CO- DESIGN

An embedded system is expressed for co-design as the quadruple< HDCG, Resources, F loorplan, P M >. HDCG describes system functionality as a hierarchical data and

(2)

* *

*

+ +

+

optional handshaking

optional handshaking 3

cond1 2

cond2 -cond1

1 6

4 7

5

9 8 Cluster node

Cluster node -cond2

Operation node

Structure of a communication CN

handshaking

handshaking Data packet 1

Data packet 2

Data packet n

(a) (b)

Figure 1: Hierarchical Data and Control Depen- dency Graph

control graph. Resources is the set of IP cores used in the implementation. F loorplan is the set of all possible ﬂoor- plans for the IP cores in set Resources. P M is a graph- based representation that denotes performance attributes, like latency and communication speed ﬂexibility.

A Hierarchical Data and Control Dependency Graph (HDCG) oﬀers a dual perspective on system functionality: a task- level description (for co-design) and an operation-level representation (for exploring hardware sharing across tasks).

Figure 1 presents an HDCG example. HDCG nodes are of three types:

• Cluster nodes (CN) represent loops, if-then-else con- structs, functions, and tasks. CN are mapped to the software domain, and are executed on coarse grained cores, like general purpose processors (GPP). Each CN is a polar sub-graph built from operation nodes. Fig- ure 1(a) shows the detailed operation structure of CN 3.

• Operation nodes (ON) denote an atomic data process- ing such as addition, multiplication, etc. These are mapped to small/medium grained IP cores, like multipliers and arithmetic and logic units (ALU). During co-synthesis, ONs are employed for analyzing the eﬀect of hardware resource sharing across tasks.

• Communication cluster nodes (CCN): Data communi- cations are modeled using CN with a special structure (see Figure 1(b)). CCNs are shown as black bubbles in Figure 1(a). A CCN includes an alternating sequence of ON nodes representing transmissions of data packets of a ﬁxed size, and synchronization nodes. The optional synchronization nodes allow packets from different communication links to be interleaved on the same bus. This facilitates the suspension of an ongo- ing communication in favor of a higher priority data transmission. Optional synchronization points represent the time overhead for synchronizing the cores.

If successive packets pertain to the same communication link, then the optional synchronization points have zero time length.

Performance Model (PM) representation is a graph-based description that relates system performance attributes to HDCG characteristics, and to the design decisions executed during co-design. Figure 2(a) shows an HDCG, and Figure 2(b) depicts the corresponding PM for latency, assuming

2 1

Start 3

4 5

End

+

+ + max + 0

max +

max

max T1_e

T4_e

T5_e T2_e

T4_ex T5_ex

T3_ex T1_ex

T2_ex

Latency

(b) PM for latency same processor core

(a) HDCG

Figure 2: Performance Model for latency that ON 2, ON 5, and ON 3 are executed in this order on the shared resource. PM include following elements:

• Starting node 0 indicates that all observed performance attributes (latency in our case) are set to value 0.

• The constant part describes the semantics of perfor- mance attributes with respect to the invariant HDCG characteristics (like data and control dependencies). It is represented in the ﬁgure as nodes and solid edges.

max and addition nodes are used in Figure 2(b) to express ON (CN) start and end times. Max nodes de- scribe that an ON (CN) can not start earlier than the moment when all its predecessors are ﬁnished (thus, the starting time has to be larger than the maximum of the end times). Outputs of max nodes indicate the starting time of their corresponding ON (CN). Addi- tion nodes describe that the end timeT i e of ON opiis the sum between its start time and its execution time T i ex.

• The variable part presents the relationship between performance attributes and design decisions taken during co-design, like partitioning and scheduling. For instance, the execution order ON 2, ON 5, ON 3 is represented in the ﬁgure as dashed arcs between the addition nodes that calculate the end times for ON (CN), and max nodes characterizing the starting times.

Other ON scheduling orders are easily captured in the PM by accordingly changing the orientation of the corresponding arcs.

• Performance attribute values (like latency in the ﬁg- ure) for a certain co-design solution results by numerically evaluating its PM.

PM are the principal data representation for the exploration loop of the hardware/software co-design process. Rules for PM handling can be set-up to avoid exploration of infeasible or dominated solution points. For example, the rules for CSF calculation avoid generating co-design solutions, which are diﬃcult to realize. This helps the exploration process, as it eliminates additional steps of verifying the feasibility of a design.

Modeling of Communication Speed Flexibility The speed of communication cluster nodes (CCN) can not be accurately estimated at the system level. This is because the bus speed depends on the bus length, thus, on the placement of IP cores and the bus architecture and routing.

Defining the bus architecture includes finding the number of different buses present in a design, and the set of cores sharing a bus. This information is not available during task partitioning and scheduling. The co-design methodology in

(3)

1

3

(a) CCN 2

+ max + + max +

T3_ex T2_csf

T2_min T1_ex

T1_e T2_e

T3_e 1

3

(b) (c) CSF 2 CCN 2_min

Figure 3: Communication speed ﬂexibility Figure 6 identiﬁes communication speed requirements for each data link, while relying on a system-level modeling of bus architecture synthesis. Communication speed requirements are feasible, if needed bus speeds can be achieved in the presence of delays caused by the RLC parasitic of bus routings.

For each data link, the communication speed flexibility (CSF) indicates the amount of delay that can be tolerated on that link without violating the required system latency. CSF values are found as part of the co-design methodology, and become constraints for the bus architecture synthesis step discussed in Section 4. Figure 3 shows the PM modeling for finding the CSF values. CN 1 and CN3 in Figure 3(a) are allocated to different processing cores, and connected through CCN 2. The execution time of CCN 2 is unknown, as the bus architecture is synthesized in a subsequent step.

Figure 3(b) shows the HDCG description for ﬁnding communication speed requirements. For each CCN, two nodes are introduced: node CCN min describes the minimum latency for data communications. Minimum latency depends on the amount of communicated data, and the maximum speed achievable for a given fabrication process and minimum interconnect length. This describes the lower bound for the communication time between two tasks. Node CSF models the unknown communication speed ﬂexibility, for which a numerical value is found during the co-design process. Figure 3(c) presents the PM including CSF. PM was build using the rule for modeling data dependencies.

The feasibility of a set of CSF values depends on the capa- bility of the bus synthesis algorithm to meet the constraints formulated by the CSF set. Section 4 explains that the quality of a bus architecture depends on (1) the speed of individual buses, (2) the amount of time overlapping between communications mapped to the same link, (3) the complexity of the bus architecture expressed as the number of buses in the architecture, and (4) the amount of core connections not required in a bus architecture.

At the system-level, criteria 2-4 are modeled by the likeli- hood of two communication channels sharing the same bus.

Two communication links are likely to share a bus, if following conditions are met: (i) same bus speed requirements (expressed at the system level through the corresponding CSF values) are acceptable for both links, (ii) there are no (very few) time overlapping between the communication schedulings of the links, (iii) there is little amount of additional unnecessary core connectivity, if the two links share the same bus, and (iv) the estimated bus length does not conﬂict with the required CSF values. Criteria (ii)-(iv) can be estimated at the system level, when mapping two CCN to the same bus. We present next the system-level modeling of bus speed to address criteria (1) and (i).

Floorplan Trees (FT), a tree structure, is used to model the IP core ﬂoorplanning at the system-level. Figure 4(a)

IP core 1

IP core 3 IP core 5

IP core 2

IP core 4

IP core 6

IP core 1 IP core 2 IP core 6

3

IP core 3 IP core 4 IP core 5 1

4 2 5 level 2

level 3

level 1

max + +

max +

(a)

(b)

(c) CSF(1,4)

CSF(3,5)

CSF(1,5) D2

D1

D3

CSF(1,2)

CSF(1,3) CSF(5,6) D4 CSF(2,6)

80

80 80 20 40

50

20

Figure 4: Modeling of CSF

for all levels j in BT, starting from level 1 do for all nodes p in BT on level j do

identify all communications (m,n), such that node p is the first parent in BT for cores m and n;

create max and addition nodes and variable D_mn and label the output as CSF(m,n);

for all existing CSF(l,k), k!=n or l!=m do insert an edge from CSF(l,k) to the max node for CSF(m,n);

end for end for end for

generate a PM input node, and label it as CSFi;

for all leaf nodes i in BT do output: CSF PM

end for

input: BT - Floorplan tree

Figure 5: Algorithm for building CSF PM presents a set of six IP cores and the data communication between them. This representation is called Core Graph, and Section 5 oﬀers more details on it. Communication Load (CL) expresses the amount of data exchanged between cores, and labels each edge in the graph. To favor tightly interacting IP cores, the placement algorithm places close to each other those IP cores, which exchange large amount of data. The hierarchical cluster growth placement algorithm (HCGP), proceeds in a bottom-up fashion, and creates clusters of cores depending on the CL of edges between cores.

Figure 4(b) shows the FT modeling. Cores 1 and 4, cores 3 and 5, and cores 2 and 6 are heavily communicating. Nodes 1, 2, and 3 represent their clustering. Resulting clusters are interconnected by edges describing the amount of data communications between all cores in the cluster. The clustering process continues by considering nodes 1, 2, and 3, and so on, until the root node is reached (node 5 in the ﬁgure).

The bus speed values searched during co-design must ac- commodate the bus speed constraints imposed by the IP core placement. Otherwise, bus speed requirements are un- reasonable (though some of them might be possible to im- plement). For example, it is unreasonable to request a high communication speed for cores placed far apart. Hence, CFS values fixed for CCNs must meet the constrained expressed by the above lemma. A naive solution would assign ran- dom values to CFS, and then check if these values meet the constraints imposed by FT. In reality, this solution does not offer good results, as most of the analyzed CFS values would violate the constraints. Instead, PM for CFS implicitly incorporate all speed constraints due to the floorplanning model. For example, Figure 4(c) shows the corresponding CSF PM. CSF values for the leaf nodes (CSF(1,4), CSF(3,5), and CSF(2,6)) are input values to the PM. According to the floorplanning, the speed for communications (1,5) and (1,3)

(4)

Performance Model Generation

Performance ModelEvaluate

Bus architecture solution +

Scheduling Partitioning (binding)

HDCG + Latency constraint + available silicon area

Update PM Ti Ri Ti_ex (1) Update Core Graph (2) Update Floorplan Tree

placement algorithm Place IP cores using the

Placed IP cores

Latency and speed requirements for communication links

Final design Set of available IP cores

the bitwise PBS generating algorithm bus structures (PBS) using

Generate set of primary description

Hierarchical cluster growth

Bus routing

Bus length

characterized for speed

Bus speed estimation through parasitic extraction for routed bus architectures

Bus architecture synthesis using the Select-eliminated method

Step 2: Bus architecture synthesis Step 1: Partitioning (binding) and scheduling

Bus architecture synthesis table partitioning

scheduling

Update Core Graph

Core Graph

Task and communication re-scheduling

Step 3: Re-scheduling Best found bus architecture Binding, scheduling and minimum bus speed requirements

Figure 6: Hardware-software co-design methodol- ogy

has to be slower than the slowest of the communications (1,4) and (3,5). The max nodes and the addition nodes in the PM formulate these constraints. ValuesD1 and D2 express the time amount by which the two communications are slower. Similarly, communication (5,6) must be slower than communications (2,6) and (3,5). Finally, communication (1,2) must be slower than communications (1,5) and (1,3). Figure 5 shows the algorithm for building CSF PM meeting the constraints ﬁxed by FT.

3. CO-DESIGN METHODOLOGY

Figure 6 presents the proposed hardware-software co-design methodology. The co-design flow partitions HDCG nodes to cores, decides the scheduling order of nodes, synthesizes the bus architecture, and maps and schedules data communications on buses. Goal is to minimize the overall system latency. The method includes three steps. The first step partitions CN nodes to GPP cores, binds ON nodes to FU cores, schedules CN, CCN and ON, and finds minimum speed requirements for CCN. First, Performance Models (PM) are generated for an HDCG using the rules in Section 3. Next, the simulated annealing exploration loop simul- taneously conducts partitioning and scheduling. For each CN (ON), attributesRi (the hardware resource that exe- cutes the node),Ti ex(the execution time on the resource), andTi (the starting time of node execution) are unknowns for co-design. CN partitioning to GPPs and ON binding to FUs is modeled by unknownsRi and Ti ex. CN, CCN, and ON scheduling is described by unknowns Ti. Possible values for unknownsRi and Ti are searched during exploration. Latency is computed by numerically instantiating all node characteristics Ri, Ti and Ti ex, and then evaluating their PMs. Co-design optimization was realized using simulated annealing algorithm (SA). The algorithm examines the quality of numerous partitioning (binding), and scheduling solutions by numerically evaluating PMs for latency and communication speed flexibility (CSF).

Partitioning (binding) and scheduling steps are executed with diﬀerent probabilities. The reason is that multiple valid schedules are possible for each resource partitioning (binding) decision. A small probabilityp1is used to select a parti-

tioning step that moves a cluster from a GPP core to another GPP core or to hardware. A probabilityp2(p2> p1) binds an ON to another FU core. The reason forp2 being greater than p1 is that multiple hardware designs are possible for each partitioning of clusters to FU cores. Finally, a proba- bility 1 - (p1+p2) decides a scheduling action. This strategy emulates a hierarchical exploration process because for each new partition (binding) there are ^1−(p_p ¹^+p²⁾

1+p2 analyzed schedules. For example, if p1 = 0.01 andp2 = 0.1 then on the average, 8 schedules are examined for each partition (binding). If the execution order of a node pair is modiﬁed then the algorithm also veriﬁes that the new ordering is feasible.

This means that no cycles can occur in the updated PM.

The cost function for SA is

Cost = α × Latency + β ×Q

CCNi

D1i +γ × # buses + δ × unnecessary connectivity ,

Whereα, β, γ and δ are weight coeﬃcients. Diis the diﬀer- ence in latencies between two successive communications.

The cost function models system latency, communication speed ﬂexibility, bus complexity, and unnecessary bus connectivity. Communication speed ﬂexibility forcesDivalues for CCN to be maximized. Larger Di values denote more feasible bus speed constraints. Bus complexity is described by the number of buses in an architecture. Number of buses is estimated by the number of links having similarDivalues and reduced time overlapping of their data communications.

Estimating the number of buses, thus cores which share a bus, also permits ﬁnding the number of unnecessary connectivity in an architecture.

The second step is bus architecture synthesis. The co- synthesis flow continues by updating the Core Graph description based on information on task partitioning and scheduling. Then, the detailed floorplan for the IP cores in the design is found using the hierarchical cluster growth placement algorithm. Core placement is needed to accurately es- timate bus lengths, and find the correct rates at which data can be communicated on buses. The introductory section explained that DSM effects are critical for characterizing the speed possible for a link. Core placement is communication driven, so that two heavily communicating cores are placed close to each other, the aspect ratio of their rectan- gular bounding box is close to one, and the total area of the box is minimized. Also from the core graph, the set of possible primary bus structures (PBS) is created. PBS are the building blocks for creating bus architectures. Then, a bus architecture synthesis table is produced to characterize the satisfaction of connectivity requirements by individual PBS structures. The actual bus architecture synthesis algorithm (called Select-eliminate method) is based on simulated annealing. Using BA synthesis tables, the method builds bus architectures, which are PBS sets that meet all the connectivity requirements in the core graph. Topological attributes are evaluated for each bus architecture, i.e., number of PBSs in an architecture, bus utilization, communication conflicts, and maximum data loses. The total bus length is estimated using the actual core placement to account for DSM effects.

The best found bus architecture is characterized for speed under RLC eﬀects. Using SA and PM, the third step binds CCN to buses and re-schedules CN, CCN and ON for the best found bus architecture and CN (ON) partitioning iden- tiﬁed at Step 1.

(5)

i = 1;

L = number of communication link elements;

do

eliminate all PBS which cover link i;

for all links k satisfied by PBS p do eliminate all PBS which cover link k;

end for;

i = i + 1;

until i > L

select a PBS p to satisfy communication link i;

B12 B13 B14 B24 B123 B134 B124B1234

X

X X

X X X X X

X

X X

X l

l l l 12 13 14 24

X e1

e2 e3

e11 e21 e12 e22 e13 e14

Primary Bus Structures Connectivity Requirements

Figure 7: Select-eliminate algorithm

4. BUS ARCHITECTURE SYNTHESIS 4.1 Bus Architecture Synthesis Algorithm

We consider only non-redundant, non-hierarchical (NRNH) bus architectures. This is motivated by our goal of designing resource constrained SoC with minimal architectures (thus minimal bus architecture) and software support. The modeling for bus architecture synthesis (including CG, PBS and BA synthesis tables) is extensively described in [11]. We propose the select-eliminate (SE) algorithm to generate NRNH bus architectures based on the satisfaction of the core connectivity requirements. SE algorithm is represented in Fig- ure 7. To illustrate the algorithm, we use the simple BA synthesis table in Figure 7. For example, to satisfy thel12

connectivity, one of the four PBS {B12, B123, B124,B1234} has to be chosen. Suppose PBSB123 is chosen, the rest of the candidates must be eliminated, so that there is no re- dundancy in the ﬁnal structure. The horizontal dash lineS1

represents eliminated structures. Once a structure is eliminated, it automatically voids the whole column. Vertical dash linese11,e12,e13,e14show the eliminated column. PBS B123 satisﬁes only the l12 and l13 connectivity. Therefore, another horizontal lineS2 is created with vertical linese21

ande22. Connectivityl14is considered next. There is a candidate left, namely, PBSB14. Once PBSB14is chosen, we have only one candidate, PBSB24, left to satisfyl24connec- tivity. It is noticed that no more horizontal lines or vertical lines are created because there is no structure existing in a table. The generated NRNH BA is composed of PBSB124, PBSB14, and PBSB24. Circled structures in Figure 7 show the ﬁnal BA.

The size of a synthesis table grows depending on the number of cores and the number of inter-core communication. If number of cores and interconnects between them is small, the SE algorithm contemplates all possible coverings of the CG links using the available PBS structures. However, if a system consists of more than 20 cores intensively tied up together, the exhaustive SE algorithm becomes infeasible.

To allow SE algorithm explore the PBS candidate space ef- ﬁciently, we employed a simulated annealing algorithm to search diﬀerent candidate PBSs while satisfying connectivity requirements. The algorithm randomly chooses a PBS from each requirement row, and combines it into a bus architecture. The total cost function that guides simulated annealing is given by the formula

Total cost = wlLt + wnNb + wcCc + wmlMl - wuCu,

* *

*

+ + +

* * *

+ +

1

2

3

4

5

6 9 8 7

17 10

11 12

13

14

15

16

RTL structure of tasks 3, 8 and 13 JPEG Task Graph

Figure 8: JPEG Task graph for co-synthesis with hardware sharing across tasks

Time Optimization

Example Without HW With HW Improvement(%) sharing sharing

2GPP + 1A + 1M 184640 182720 1.03

2GPP + 2A + 2M 157120 125120 20.3

2GPP + 3A + 3M 137920 125120 9.28

3GPP + 1A + 1M 182720 182720 0

3GPP + 2A + 2M 157120 125120 20.3

3GPP + 3A + 3M 137920 99520 27.84

Table 1: Example 2: experimental results whereLt is the total bus length,Nbis the number of buses, Cc is the communication conflict,Mlis the maximum loss, andCuis the total bus utilization. wl,wn,wc,wml,wuare weight factors. Maximum loss reflects the maximum data loss in a BA, if there is a conflict in a particular PBS. The algorithm objective is to minimize this cost function.

5. EXPERIMENTAL RESULTS

Experiments were set up to study the effectiveness of the proposed hardware/software co-design algorithms. Due to the difficulty to relate our co-design objectives to other work, experiments addressed subsequent aspects, starting from aspects that are more similar to existing methods, and con- tinuing with those that are specific to this work. The co- synthesis method was implemented in about 3,000 lines of C code. Bus synthesis algorithm is about 1,500 lines of C code.

Experiments were run on a SUN Sparc 80 workstation.

In the first experiment, Figure 8 presents an example in- spired from the JPEG algorithm. The task graph included 17 tasks. The RTL structure of tasks 3, 8, and 13 is shown in the right part of the figure. Six experiments were con- ducted. Each experiment employs a different number of general purpose processors (GPPs) for the software part, and a different number of modules (adders (A) and multipliers (M)) for the hardware part. Column 1 in Table 1 shows the number of hardware resources for each example.

Columns 2, 3, and 4 present the latencies offered by co- design without and with hardware sharing across tasks, and the corresponding latency improvements. Note that latency improvement can be as high as 27% (Line 6 in the table), if a high task and operation concurrency can be secured for the system. For this case, concurrency of operations of different hardware tasks results in significant latency improvements.

In one case (Line 4 in the table) latency improvement was 0%. Reason is that only 1 adder and 1 multiplier were used for the hardware part. The ability of co-design to perform resource sharing across hardware tasks could not be used in this case.

BA synthesis was used to automatically produce opti-

(6)

mized bus architectures for the SoC of the JPEG image com- pression encoder. The task graph for the JPEG encoder included three identical and parallel sequences of tasks. Each sequence processes a different color of an image (RGB), and includes five consecutive tasks: preprocessing, FDCT, quantization, zig-zag, and RLE & Huffman coding. The hardware-software co-design methodology in Figure 6 was performed on the JPEG task graph. For each sequence, tasks preprocessing, quantization, zig-zag, and RLE & Huff- man coding were partitioned into software, and task FDCT into hardware. The architecture included three processor cores (a distinct core for each parallel sequence), an ASIC for the FDCT tasks, and memory modules for data communication. Each processor has its own local memory. Proces- sors and ASIC communicate through shared memory. To decrease memory access times, the first architecture considered interleaved memory blocks M4-M9. Figure 9 shows the resulting Core Graph. The considered processing technol- ogy was 0.18µ TSMC. Microprocessor cores were of about 5× 5 mm², memory cores of about 25% of the area of processor cores, and ASIC were about 30% of processor core area.

Figure 9(c) shows bus architecture synthesis results for the top CG in Figure 9(a). The hierarchical cluster growth algorithm generated the core placement shown in Figure 9(b).

The synthesis goal was to generate a fast architecture. Bus architecture complexity was not a major concern, because the number of IP cores is reasonably high. Thus, the goal of BA synthesis was to minimize communication conﬂicts (wc= 1.0), minimize the total bus length (wl= 1.0), while disregarding the number of buses and redundant structures in a BA (wn=wr= 0.1). After bus architecture generation, each of the buses was routed, and the resulting delays are indicated in the ﬁgure. Please note that theuP1− MEM1

bus delay is larger than that ofASIC −MEM8, but the bus ASIC − MEM8 carries a lower communication load. Even though it is against common sense, this result is correct because the bus length depends on the distances between cores and cores perimeters. In this case,the core size dominated the total bus length and speed.

Note that the best BA is not perfectly regular, even though the CG is regular. Processor P1, and memory modules M4 and M5 are linked through a shared bus, similar to processor P2 and memory blocks M6 and M7. This happens because the placements of these blocks is similar. However, processor P3 and memories M8 and M9 are linked through a diﬀerent structure, which improves the speed of the bus for the spe- ciﬁc placement of these blocks. This explains that optimized BA do not depend only on architectural level elements (like the amount of exchanged data between cores), but on layout aspects, also. BA synthesis took less than 5 minutes on a SUN Blade 100 workstation. This motivates that the pruning method of the BA synthesis algorithm allowed to quickly explore the very large solution spaces resulting for SoC with many cores.

6. CONCLUSION

The proposed co-design methodology improves the practicality of system-level design for DSM technologies. The bus synthesis algorithm creates customized bus architectures in a short time depending on the data communication needs of the application, and the required performance. Layout information is important in deciding the bus architecture

µP1 µ

M7 µ

ASIC

M1

M1 M2 M3

P2

M4 M5 M6 M8

P3

M9

40 40 40

20 20 20 20 20 20

20 20 20 20 20

20

13 1

2 7

8

10 9

11 12

4

3 5 6

MEM MEM

MEM

MEM MEM

MEM 1

2

3 2

1 3

5 4

6 7

8 9

ASIC ARB

ARB µP

µP

µP Td = 10.927 ns

Td = 8.028 ns

Td = 23.429 ns

Td = 4.028 ns

Td = 23.429 ns Td = 8.028 ns

Td = 10.927 ns Td = 10.927 ns

a) Core Graph for JPEG SoC b) Placement Layout for JPEG SoC

c) Bus Architecture for JPEG SoC

Figure 9: Experiments on JPEG SoC topology. Experiments showed that it is impractical to pos- tulate a unique bus architecture as the best, as there is little re-using among bus architectures optimized for diﬀerent constraints. Experiments also indicated that PM are more eﬀective than well known synthesis metrics, like task prior- ities.

7. REFERENCES

[1] F. Balarin, L. Lavagno, P. Murthy, A. Sangiovanni-Vincentelli,

“Scheduling for Embedded Real-Time Systems”, IEEE Design &

Test of Computers, Jan-March 1998, pp. 71-82.

[2] J. Darringer, R. Bergamaschi, S. Battacharyya, D. Brand, A.

Herkersdorf, J. Morell, I. Nair, P. Sagmeister, Y. Shin, “Early Analysis Tools for System-on-a-Chip Design”, IBM Journal of Research & Development, Vol. 46, No. 6, Nov. 2002, pp. 691-707.

[3] J. M. Daveau, G. F. Marchioro, T. Ben Ismail, A. A. Jerraya,

“Protocol Selection and Interface Generation for HW-SW Codesign”, IEEE Trans. on VLSI Systems, Vol. 5, No. 1, pp.

136-144, March 1997.

[4] R. Ernst, “Codesign of Embedded Systems: Status and Trends”, IEEE Design & Test of Computers, April-June, 1998, pp. 45-54.

[5] R. Gupta, “Co-Synthesis of Hardware and Software for Digital Embedded Systems”, Kluwer, 1995.

[6] K. Keutzer et al, “System Level Design: Orthogonolization of Concerns and Platform-Based Design”, IEEE Trans. on CADICS, Vol. 19, No. 12, December 2000.

[7] K. Lahiri, A. Raghunathan, S. Dey, “System-Level Performance Analysis for Designing On-Chip Communication Architectures”, IEEE Trans. on CADICS, Vol. 20, No. 6, June 2001, pp 768-783.

[8] R. Ortega, G. Boriello, “Communication Synthesis for Distributed Embedded Systems”, Proc. of the International Conference on Computer-Aided Design, 1998, pp. 437-444.

[9] K. Strehl, L. Thiele, D. Ziegenbein, R. Ernst, J. Teich,

“Scheduling Hardware/Software Systems Using Symbolic Techniques”, Proc. of the International Workshop on Hardware/Software Co-Design, 1999, pp. 173-177.

[10] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, A. Sangiovanni-Vincentelli, “Adressing the System-on-a-Chip Woes Through Communication-Based Design”, Proceedings of the Design Automation Conference, 2001, pp. 667-672.

[11] N. Thepayasuwan, A. Doboli, “Bus Architecture Synthesis for Hardware-Software Co-Design for Deep Submicron Systems on Chip”, Proc. of IEEE International Conference on Computer Design, 2003.

[12] T. Y. Yen, W. Wolf, “Hardware-Software Co-synthesis of Distributed Embedded Systems”, Kluwer, 1997.