Architecture and operating system design of the M2 Database Machine

(1)

IEEE TENCON ‘93 / Beiinn

Architecture

and

of

the

M 2

Operating System Design

Database Machine

*

Yen-Jen Oyang, David Jinsung Sheu,

Chih-Yuan Cheng, and Cheng-Zen

Yang

Department of Computer Science

and Information Engineering

National Taiwan University

Taipei, Taiwan,

China

Abstract

This paper discusses the architecture and operating sys- tem design of the M2 Database Machine. The primary design goal of the M 2 is to provide a cost-effective solution to large database processing. The M z architecture is essentially a bus-baged two-level hierarchical multiprocessor with a shared-memory multiprocessor structure at the lower level and a shared-disk structure at the higher level. The main distinction of the

M 2

architecture is the high-bandwidth message-passing bus that forms the backbone of the second level, the higher level, of the hierarchy. The high bandwidth of the message-passing bus extends the scalability of the

M 2

architecture to more than one hundred CPUs. Another distinction of the

M a

database machine is its system software. The M 2 incorporates a directory-base coherence protocol in its distributed file system. The directory-based coherence protocol allows efficient file sharing at the page or file block level.

1 Introduction

Parallel database machines are generally classified into three groups: shared memory, shared disk, and shared nothing [l]. ,Though shared-nothing machines attract most attention in today’s research community [1,2], all

three will eventually own a significant share in red- world database processing. I t can be further expected that, aa hardware technology continues to advance, the shared-disk machine will claim a larger share in 1 he future. The main reaaon is that we now have been able to build shared-disk machines almost as cost-effectively as shared-memory machines and with a scalability of more than one hundred CPUs. Two factors contribute to this development. The first factor is the introducing of gi-

gabyte message-passing buses such as Futurebus+ [3,4]. * T b research w m sponsored by National kience Council

under pant NSC 8Z-MOB-E00Z-G31

The second factor is the advances of

VLSI

and packaging technologies that make it poesible t o implement a shared-memory multiprocessor module in a printed circuit board. The introducing of gigabyte message-passing buses extends the scalability of a shared-disk machine to a t least an order of magnitude higher than that of a shared-memory machine. Meanwhile, the high inte- gration capacity made available by advanced VLSI and packaging technologies lowers the building costs of a shared-disk machine to the level of a shared-memory machine. Motivated by these observations, we started the

M a

database machine project.

In the following part of this paper, section 2 discusses the architecture design of the

M a .

Section 3 elaborates the operating system development effort. Section 4 presents a prototype

M 2

machine we have been building since Spring of 1991.

2 Architecture and Design Deci-

sions

2.1 Overview

of

the

M Z

Architecture

Figure 1 depicts the hardware block diagram of the M’.

The M2 architecture comprises two levels of multiprw cessing hierarchy. At the first level of the hierarchy, multiple CPUs, each with a private cache, and a shared cluster memory are placed on a printed-circuit board and connected through an on-board snooping bus to form a CPU cluster. In the M 2 , the shared cluster memory in a CPU cluster serves as the main memory to the CPUs in the cluster. Therefore, structure-wise, there is no

dif-

ference between a

M Z

CPU cluster and a conventional shared-memory shared-bus multiprocessor.

The second level of the

M 2

hierarchy is made up of multiple CPU clusters connected through a backplane message-passing bus. At this level of hierarchy, memory is distributed in both physical and logical senses.

(2)

Tliat is, the memory of a cluster is accessible only to the cluster itself and is not accessible to other clusters. Communication between clusters is carried out through passing messages.

In the M z , also connected to the backplane message- passing bus are 1/0 controllers. The 1 / 0 controlters and the CPU clusters operate under a client-server model. Some I/O controllers, e g. disk controllers, are associ- ated with a large memory which serves as the disk/file cache. In such cases, the memory in the CPU cluster and the memory associated with the 1/0 controller constitute a two-level disk/file cache. Data consistence between the memories in the CPU clusters and 1/0 controllers is maintained through executing a directory-based coherence protocol.

2.2

Architectural Features and Design

Considerations

This subsection elaborates the main features and design considerations of the M 2 architecture. The M 2 is a two- level hierarchical multiprocessor. If compared with other hierarchical multiprocessors [5,6], the M 2 is distinctive in its memory organization a t the second level of the hierarchy. In the M 2 , a physically distributed with no remote access memory organization is employed. This design is aimed at avoiding severe page-swapping-induced inter- CPU interference. The page-swapping-induced inter- CPU interference is an inheritance of memory sharing among CPUs. For a group of CPUs that share memory, regardless of if the memory is physically distributed or not, each CPU must be notified with the page-swapping events occurring in the shared memory so that the CPU would write back the blocks that are cached in its private caches from the page being swapped out and update its TLB (Translation Lookaside Buffer) contents accord- ingly. Since the page-swapping-induced inter-CPU interference grows linearly with the number of CPUs that share memory, it is inappropriate to employ the shared- memory approach beyond a certain extent. Therefore, it

w a s determined that a physicaily distributed with no re- mote access memory scheme should be employed a t the second level of the

M 2

hierarchy.

Nevertheless, the presence of the page-swapping- induced inter-CPU interference does not imply that a shared-memory design should not be used in any case. The shared-memory approach is still favorable up to

certain extent due to its hardware simplicity and cost- effectiveness. This is the reason why VLSI chip sets that implement shared-memory shared-bus multiproces- sors have become popular lately. Aiming t o take ad- vantage of this development, we decided to employ the shared-memory shared-bus structure at the first level of the M2 hierarchy.

O.ne important observation on the structure of the M2 is that it is basically the same as a group of multiprocessors connected through a local area network. However,

the M z is superior in system scalability since a backplane

bus offers much higher communication bandwidth than a local areanetwork. For example, a25Bbit Futurebus+ [3,4] can transfer up to 3.2 gigabytes, equivalent to 25.6 gigabits, of data per second. On the other hand, a

FDDI

network, as of today, can transfer 100 megabits ?f data per second and may be upgraded to 200 megabits per second in the near future, which is still two orders of magnitude lower than the bandwidth of the Futurebus+.

T h e last note on the

M 2

architecture is that there is a natural match between the M 2 architecture and the architecture of the Mach operating system [7]. In

the hlach, threads within a task are sharing-resource light-weight parallel entities. In the M 2 , the CPU cluster, with multiple CPUs and a shared cluster memory, provides a good execution platform for multiplethread tasks. At the higher level, Mach tasks, the heavy-weight parallel entities, can be dispatched t o M2 CPU clusters for parallel execution.

3 Operating System Develop-

ment

for

the

M 2

T h e M’ will run the Mach operating system (71 with a r e striction on process scheduling. On the

M 2 ,

Mach tasks are dispatched to CPU clusters in their entirety. In other words, the threads within a task are dispatched only to the CPUs of the cluster that the task is dispatched to and will not spread to other CPU clusters. One crucial iasue in the system software design for the

M Z

is how t o main- tain d a t a consistence among the cached disk/file blocks in different CPU clusters. This issue is essentially the coherence problem in distributed file systems (81. However, we decided not to adopt the mechanism implemented in existing distributed file systems because the backplane messagepassing bus in

M 2 ,

with its high bandwidth, can stand higher degree of data sharing among distributed disk/file caches than the locd area network in a d i e tributed system can. For this reason, we derived a new protocol from the directory-based cache coherence pro- tocols proposed for large-scale multiprocessors [9]. The protocol used in the

M’

offers data sharing a t the page or file block level and incorporates a distributed directory [9] with the ownership [lo] and wrikdelayed mech- anisms. In this protocol, each cached copy of a page is in one of the following 6 states:

1. W V (Invalid): The contents of this copy is invalid. 2. UNO (Unowned): The contents of this copy is valid but the CPU cluster is not the owner of this page.

3.

OED

(Owned exclusively and duty): The CPU c l u s ter owns a dirty copy of this page and no other CPU clusters has a duplicated copy of this page. 4. O E C (Owned exclusively and clean): T h e CPU

cluster owns a clean copy of this page and no other

(3)

CPUCluster M

CPUClUaBI 1

Backplane Messsge-Passing Bus

4

I

b

Figure 1: Hardware architecture

CPU clusters has a duplicated copy of this page 5 . OND (Owned nonexclusively and dirty): The CPU

cluster owns a dirty copy of this page and some other CPU clusters have a duplicated copy of this page.

6. ONC (Owned nonexclusively and clean): The CPU cluster owns a clean copy of this page and some

other CPU clusters have a duplicated copy of this page.

The transition between the states is a response to the action taken by the CPU cluster. The possible actions that a CPU cluster may take when operating are listed in the following.

Read: When a CPU cluster intends to read the contents of a page that it does not has a copy, it sends a Read request to the I/O device that stores the page. If the 1/0 device determines that no CPU cluster owns the page at this moment, then it sends a copy of the page to the requesting CPU cluster and the requesting CPU cluster enters the OEC state. Otherwise, the I/O device will notify the CPU cluster that owns the page to send a copy to the requesting CPU cluster. In the later case, the requesting CPU cluster will enter the UNO state.

of the

M Z

database machine

Read for Ownership: When a CPU cluster intends to update the contents of a page that it does not has a copy, it sends a Read for Ownership request to the 1/0 device that stores the page. The 1/0 device then notifies the CPU clusters that contain a cached copy of the page to invalidate their copies. If the page is currently owned by a CPU cluster, then the owner needs t o send a copy of the page to the requesting CPU cluster as well as write back t o the

I/O device. If no CPU cluster owns the page, then the I/O device will presents the requesting CPU cluster with a copy.

On

completion of all these actions, the requesting CPU cluster enters the OED

state.

Write for Invalidation: When a CPU cluster intends to update the contents of a page that it has a copy in UNO, ONC, or OND state, it sends a Write for Invalidation request to the 1/0 device that stores the page. The 1/0 device then notifies the CPU clusters that also contain a cached copy of the page t o invalidate their copies. If the page is currently owned by a CPU cluster, then the owner needs to

write back to the I/O device. On completion of all these actiops, the requesting CPU cluster enters the OED state.

(4)

Write: When a CPU cluster intends to update the contents of a page it owns exclusively, it needs to change the state of the copy to the OED state if the copy is originally in the OEC state.

the 14th Annual International Symposium on Com- puter Architecture, 1987.

[6]

D.

Cheriton, 11. A. Gooaen, and

P.

D. Boyle, “Paradigm: A Highly Scalable Shared-Memory Multicomputer Architecture”, IEEE Computer,

4 Development of

a

Prototype

Feb., 1991.

M2

We have been building a prototype

M z

machine since the Spring of 1991. In the prototype machine, the Multibus I1 [ 111 is employed as the backplane message-paasing bus and each CPU cluster comprises 2 Sparc CPUs along with a 64-megabyte cluster memory. The CPUs and the cluster memory are actually placed on two separate boards. the CPU board and the memory board. These

[7]

A.

Tevanian Jr., “Architecture-Independent V i t u d Memory Management for Parallel and Distributed Environments: The Mach Approach”, Ph.D. The- sis

,

Dept. of Computer Science, Carnegie-Mellon University, 1987.

[8] E. Levy and A Silberschatz, “Distributed File Sys- tems: Concepts and Examples”, ACM Computing Survey, Vol. 22, No. 4, December, 1990.

two boards are connected through a loeal bus separate

f r o d the Multibus 11. A s of today, we have completed the hardware design of the CPU cluster and moved to operating system development.

[9] D. Chaiken, C. Fields, K. Kurihara, and A. Agar- “Directory-Based Cache Coherence in Large- Scale Multiprocessors”, IEEE Computer, Vol. 23, No. 6, June, 1990.

5 Conclusion

[lo]

R. II.

_{and R.}Katz, _G._{Sheldon, “Implementing a Cache Consis-}S. J. Eggem, D. A. Wood, C. L. Perkins, tency Protocol”, Proc. of the 12th Annual Interna- tional Symposium on Computer Architecture, 1985. [ l l ] Intel Corporation, Multibus I1 Bus Architecture In this paper, we elaborated the architecture and operat-

ing system design df the M Z database machine. The primary design goal of the

M Z

is to provide a cost-effective solution to large database processing. The Ma machine

is well characterized by the following two features: Specification Handbook, Intel Corporation, 1984. 1.

A

cost-effective hierarchical multiprocessor architec-

ture that is scalable up to more than one hundred CPUs.

2. A directory-based coherence protocol for the distributed file system that maintains disk cache cc- herence a t the page or disk/file block level.

References

M. Abdelguerfi and A.

K.

Sood, “Database M b chines: Trends and Opportunities”, IEEE Micro, Vol. 11, No. 6 , December, 1991.

D. J. DeWitt and J. Gray, ”Parallel Database Sys- tems: The Future of Database Processing or a Pass- ing Fad?”, SIGMOD Record, Vol. 19, No. 4, Decem- ber, 1990.

IEEE, IEEE Standard Backplane Bus Spedce- tion for Multiprocessor Architectures: Futurebus+, IEEE Standard 896.2, 1991.

IEEE, IEEE Standard Backplane Bun Specification for Multiprocessor Architectures: fiturebus, IEEE Standard 896.1, 1987.

A. W. Wilson, “Hierarchical Cache/Bus Architec- ture for Shared Memory Multiprocessors“, Proc. of