中中

(1)

中

中中

中華華華華大大大大學學學學

碩碩碩

碩士士士士論論論論文文文文

基於基於基於

基於 SRB SRB SRB 核心技術的分散式檔案系 SRB 核心技術的分散式檔案系核心技術的分散式檔案系統核心技術的分散式檔案系統統統

SRBFS: A User-Level Grid File System for Linux

系所別: 資訊工程學系碩士班學號姓名: E09002057 朱金城指導教授: 許慶賢博士

中華民國九十六年七月

(2)

SRBFS: A User-Level Grid File System for Linux

By

Chin-Chen Chu

Advisor: Prof. Ching-Hsien Hsu

Department of Computer Science and Information Engineering

Chung-Hua University, Hsinchu, Taiwan

July 2007

(3)

Abstract

Datagrid rapidly grow and diversify in days, object based file system is becoming a heavy role in data concentrative computing. A high-performance file system is normally a key point for large cluster installations, where hundreds or even thousands of nodes frequently need to manage large volumes of data.

Distributed Storage Resource Broker (SRB) Server architecture is designed for facilitating reliable file sharing, high-performance distributed and parallel data computing in a Grid across administrative domains by providing a global virtual file system.

In the thesis, we implement SRBFS which is a user level file system built on the FUSE framework that relies on an underlying file system for access to the disk. SRBFS is an attempt to implement a global virtual file system that supports a complete set of standard POSIX APIs, while still retaining the parallel and distributed data computing feature of distributed SRBFS Server architecture. This thesis describes the design and implementation of SRBFS and evaluates its performance using several benchmarks. We also investigate the possibilities of a distributed model for the SRB server which is based on SDSC SRB client-server middleware. We present a comparison of wide-area file transfer performance of the SRBFS and SEMPLAR which is SRB enabled MPI-IO library for access to remote storage. On the NCHC ( National Center For High-Performance Computing ) cluster, the ROMIO Perf benchmark attained an aggregate write bandwidth of 225 MBbps with 12 clients. The benchmark results are encouraging and show that distributed SRB Server provides applications with scalable, high-bandwidth I/O across wide area networks.

Keywords: Grid, Object based file system, SRB.

(4)

中文摘要

資料格網(Data Grid)在近日已快速成長且正朝著多樣性的變化，而在以集中資料屬性的電腦計算領域裡物件檔案系統日益顯的重要，在建置大型的叢集電腦時，高效能的檔案系統是一項重要關鍵，當電腦節點超過千百時就需要很頻繁的管理大量資料。在格網的計算環境裡，設計一個分散式存儲資源代理系統(SRB)架構的全域的虛擬檔案系統，促使其具穩定的檔案分享、高效能資料配送及平行資料傳輸特性．SRBFS 就是嘗試建置一個支援完整 POSIX API 功能呼叫的全域虛擬檔案系統，透過分散式的 SRBFS 伺服器架構，使維持資料傳輸平行與分散式資料計算的特色。

本研究主要設計並開發一個基於 SRB 中介軟體核心技術的分散式檔案系統，建置一個用戶層級的虛擬分散式檔案系統，採用 fuse 中介軟體套件，利用其檔案程式庫與資料傳輸核心模組，使其在一個混合檔案系統的框架裡，

可以使得應用程序可透明的讀取無限制的檔案。

本篇論文主要描述 SRBFS 檔案系統的設計與實作，並使用不同的效能量測工具進行檔案系統的效能分析。並探討分散式架構的 SRB 資料儲存節點資料傳輸效能分析與可行性評估。

在國家高速電腦與網路中心的叢集電腦，跨越廣域的真實網路環境進行平行資料傳輸，針對 SRBFS 與 SEMPAR 不同檔案傳輸系統做效能評比，當十二台用戶端電腦同時平行傳輸資料，經由 ROMIO Perf 性能測試工具測試，其測試結果資料傳輸聚集速度可高達 225MBp。這樣的效能量測數據，證明可以在真實廣域網路上進行資料傳輸，採取分散式 SRB 架構配置是可行的，且已具備延展性與高效能的特性。

關鍵字關鍵字

關鍵字關鍵字::::格網格網格網格網,,,,物件檔案系統物件檔案系統物件檔案系統物件檔案系統,,,,存儲資源代理系統存儲資源代理系統存儲資源代理系統存儲資源代理系統

(5)

Acknowledgements

I would like to thank my research advisor, Professor Ching-Hsien Hsu, for being a consistent source of support and encouragement. I would like to thank NCHC researchers Barz Hsu for his invaluable help to develop SRBFS and his constant support, and insightful comments.

Finally, we would like to thank the NCHC sites that provided access to the SRB infrastructure and the servers for this work, as well as the anonymous reviewers who provided helpful and substantive comments.

(6)

List of Figures

Figure 1: Grid File System Architecture...4

Figure 2: SRB use of heterogeneous storage. ...13

Figure 3: File system function call order ...17

Figure 4: Major parts of SRBFS client ...20

Figure 5: The ADIO Architecture ...21

Figure 6: Experimental Diagram of SRB Cluster ...24

Figure 7: The Experiment Topology...26

Figure 8: IDE, FC Disk Bonnie Results ...28

Figure 9: SRBFS Bonnie Read Performance...30

Figure 10: IOZONE Read Performance ...32

Figure 11: IOZONE Write Performance...32

Figure 12: Write performance comparison with varying buffer size...33

Figure 13: Read performance comparison with varying buffer cache size ...33

Figure 14: SEMPLAR, SRBFS and NFS Perf Write Performance ...35

Figure 15: SEMPLAR, SRBFS and NFS Perf Read Performance ...35

Figure 16: Distributed SRB Server Perf Write Performance ...36

Figure 17: Distributed SRB Server Perf Read Performance...36

(9)

List of Tables

Table 1: Mapping between SRBFS,MPI-IO, ADIO and SRB API...22 Table 2: IDE, FC Disk Bonnie Results ...28 Table 3: SRBFS Bonnie Performance ... 29

(10)

Chapter 1. Introduction

1.1. Motivation

Datagrid[1] rapidly grow and diversify in days. DataGrid is with the aim of setting up a computational and data-intensive Grid of resources for the analysis of data coming from scientific exploration. New, inexpensive storage technology is making terabyte and petabyte scientific data stores feasible, desired to be accessed both physically close to the place of data origin and by clients around the world.

SRB[2] stands for Storage Resource Broker, which provides a uniform interface for users to obtain data from a heterogeneous and distributed collection of data repositories. Efficient execution of network-bound distributed data intensive applications on computational grids can be challenging. It contains a database like system for tracking data and replicating it between different servers. SRB is useful in areas where large data sets can be distributed across multiple systems, but can be accessed transparently from any server using SRB clients. This does not directly apply to the Bio-Mirror where exact copies of all data are distributed to all servers, and downloaded by users from only one server.

SDSC’s Storage Resource Broker is middleware that provides data-intensive applications with a uniform API to access heterogeneous distributed storage resources systems including file systems, databases, and hierarchical and archival storage systems. SRB provides users the capability to access and aggregate massive quantities of data scattered across wide area networks.

Object based file system[3] is becoming a heavy role in data concentrative computing. Object based file system is composed with metadata and physical file

(11)

entity, by extension of metadata, file entity itself can be very meaningful and not only just a single file. The earlier approach to build an object based file system is using database and categorized directory entries, controlled by a set of programs or system scripts. This approach is very complicated and messy, programs with lots of interface between different storage are hard to maintain. Problems would happen if storage scales up. One of the challenges in high-performance computing is to provide users with reliable, remote data access in a distributed, heterogeneous environment. The increasing popularity of high-speed wide area networks and centralized data repositories lead to the possibility of direct high-speed access to remote data sets from within a parallel application.

The most commonly used approach to move data between machines is the remote storage access model. Tools such as SDSC’s SRB provide a way for applications to transparently access data stored in a variety of storage servers. These tools present to the application an interface that is very close to the standard Unix I/O API (i.e. with primitives such as open, read, write, close), which means they are simple to use and only minimal modifications are required to existing applications.

The approach is greatly simplified by having the collection own the data, using a logical name space to manage state information about each digital entity, and using a storage repository abstraction for the operations performed upon storage systems, an information repository abstraction for the operations used to manage a collection in a database, and an access abstraction to make it easy to support additional APIs.

Logical name spaces are used for digital entities (data virtualization), users (distinguished names managed by the collection), and resources (logical resource names to support operations on sets of resources). A object storage environment is provided, in which the access to files is based on Unix file operations, rather than disk/block operations.

(12)

1.2. Objective

In our work we focused on the SRBFS[4] design and implementation. The SRBFS server based on SDSC SRB middleware. SRB is client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets. In conjunction with the Metadata Catalog (MCAT), it provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations. SRBFS is a user level file system built on the FUSE framework that relies on an underlying file system for access to the disk.

In this thesis we investigate the possibilities of a distributed model for the SRB server. Our contribution here is to show how about the performance of distributed model[5], and present a comparison of wide-area file transfer performance of the SRBFS and SEMPLAR[6].

1.3. Thesis Organization

The rest of the thesis is organized as follows: Chapter 2 summarizes the related research that has been done and introduces many exiting grid file systems. Chapter 3 presents motivation and approach. Chapter 4 we describe SRBFS architecture and give an overview of SRBFS and SEMPLAR respectively. Chapter 5 presents our experimental setup and discusses the benchmarks results. Finally, the conclusions and future work are given in Chapter 6.

(13)

Data Sources Data Services

Virtual Directory Service (Management of virtualization)

Coordinated with other groups Hierarchical

Logical Name space +

ACL + metadata

Applications (Astronomy, Physics, Life Science, business apps, . . .) browser

NFS/CIFS … Grid File System Service (POSIX-like Interface)

Chapter 2. Background and Related Works

Traditionally, data is shared among machines in a network using distributed and parallel file systems. File systems like AFS (Andrew File System)[7], NFS (Network File System)[8] and DFS (Distributed File System)[9] provides mechanisms to access remote data through POSIX interfaces. Traditional distributed file system technologies, like NFS, require specific mount points in a logical directory hierarchy.

These file systems are usually not scalable over a wide area network. They also do not have the concept of virtual organizations with heterogeneous policies.

Figure 1. Grid File System Architecture

Currently, various middleware provide file system style functionality for accessing data on the grid. This Grid file system architecture (See Figure 1) enables efficient, dependable, and transparent file sharing. Grid. Grid File System Group (GFS-WG) of Global Grid Forum[10] provides the specification of Architecture of Grid File System

(14)

Services. It will specify the hierarchical structure to facilitate federation and sharing of virtualized data from file systems in the grid environment by providing the virtual namespace that will allow association of access control mechanisms and meta-data of the underlying physical data sources.

There is a rich set of tools available in this category and are closest to providing file system services. Middleware like Globus[11], Grid Datafarm[12], Legion[13], and provide various POSIX primitives for accessing files in a data grid.

Traditionally, data is shared among machines in a network using distributed and parallel file systems.

Distributed filesystems such as NFS, AFS and DFS provide a uniform interface to access remote data through POSIX interfaces. However, the above filesystems are usually not scalable over a wide area network. Traditional distributed file system technologies, like NFS, require specific mount points in a logical directory hierarchy.

They also do not have the concept of virtual organizations with heterogeneous policies.

Currently, various Data Grid middleware provide file system style functionality for accessing data on the grid. This Grid file system architecture enables efficient, dependable, and transparent file sharing. The scientific researcher analysis data require complete abstraction of the logical hierarchy and the physical files. This is perhaps a key difference between traditional distributed file systems and Data Grid middleware. Grid File System Work Group of Global Grid Forum provides the specification of Architecture of Grid File System Services. It will specify the hierarchical structure to facilitate federation and sharing of virtualized data from file systems in the grid environment by providing the virtual namespace that will allow association of access control mechanisms and meta-data of the underlying physical data sources.

(15)

Other researchers have worked on grid middleware services such as SRB.

Nallipoguet al.[14] proposed a mechanism to increase the data transfer throughput by pipelining the various data transfer stages such as disk access and network transfer.

Bell etal.[15] optimized the SRB protocol by overlapping the network communication and disk access.

2.1. GfarmFS

The Grid Datafarm architecture[15] is designed for global petascale data-intensive computing. It provides a global parallel file system with online petascale storage, scalable disk I/O bandwidth, and scalable parallel processing performance. The global parallel file system consists of local disks of cluster nodes in a grid of clusters.

The Grid Datafarm (Gfarm) provides a Grid file system that federates multiple local file systems in a Grid across administrative domains. The Grid file system provides virtualized hierarchical namespaces for files or virtualized file system directory tree having virtual access control with flexible capability management and other metadata association. Users or Grid applications access the Grid file system via POSIX file I/O APIs and Gfarm native file I/O APIs with extension of new file view semantics.

Each file in the virtual file system directory tree is mapped by a replica catalog to one or more physical file locations. File replicas bring fault tolerance as well as access contention avoidance features. This Grid file system architecture enables efficient, dependable, and transparent file sharing in the Grid. At this time, we have endeavored to standardize Grid file system with other projects in Grid File System Working Group of Global Grid Forum.

The Grid Datafarm architecture assumes every cluster node has a local file system that will be federated by a Grid file system. It also has a feature to support worldwide

(16)

parallel and distributed executions for high-performance data processing, introducing a Gfarm file or a superfile, a new process scheduling called file-affinity scheduling based on file locations, and new parallel file access semantics. A group of files dispersed in several cluster nodes can be managed as a single regular file called a Gfarm file. File-affinity scheduling allocates a set of nodes based on replica locations of member files of a specified Gfarm file. A new file view called local file view, enables parallel access to member files of the Gfarm file. File-affinity scheduling and new concept of the file view enable the “owner computes” strategy, or “move the computation to data” approach for parallel and distributed data analysis of member files of a Gfarm file in a single system image.

File replication of a Gfarm file means a set of file replications of member files.

Because there is no dependency to create a file replica for each member file, it can be replicated in parallel and independently. When member files are dispersed in different cluster nodes, file replication of a Gfarm file is considered to be parallel direct third-party file replications from multiple cluster nodes to different multiple cluster nodes. Due to the direct local disk and network access on each cluster node, it is possible to attain scalable disk I/O performance for high-bandwidth file replication.

The Gfarm middleware is a reference implementation of the Grid Datafarm architecture, which is available at the Grid Datafarm web site.

GfarmFS is module developed on LUFS (Linux Userland File System)[17]

framework. It enables user to mount Gfarm system to their file system. So, they can use Gfarm as their local file system that provide easily way to use Gfarm.

2.2. LegionFS

Legion is an object-based grid operating system charged with reconciling a collection of heterogeneous resources, dispersed across a wide-area, with a single

(17)

virtual system image. Legion provides resource management, scheduling, and other system-level tasks, as does any operating system; however, it does so on a much wider scale. Built from the ground up, Legion addresses such issues as scalability, programming ease, fault tolerance, security, and site autonomy.

Legion is a middleware architecture that provides the illusion of a single virtual machine and the security to cope with its untrusted, distributed realization. From its inception, Legion was designed to deal with tens of thousands of hosts and millions of objects – a capability lacking in other object-based distributed systems.

LegionFS[18] has been designed and implemented to meet the security, fault tolerance, and performance goals of wide-area environments.

The peer-to-peer architecture eliminates performance bottlenecks by eliminating contention over a single component, yielding a scalable file system. Flexible security mechanisms are part of every file system component in LegionFS, which allows for highly-configurable security policies. The three-level, location-transparent naming system allows for file replication and migration as needed.

LegionFS works with the Legion object-to-object protocol as well as lnfsd, a user-level daemon designed to exploit UNIX file system calls and provide an interface between NFS and LegionFS.

2.3. CodaFS

Coda[19] is a system, which has many interesting features, for example it supports disconnected and mobile operation.

The system consists of volumes. The volume is provided by a group of replicated servers, called Volume Storage Group. The replication is ensured by clients, which read from one server but write to all of them.

The clients also initiate solving of an effect of temporarily disconnection of several

(18)

servers by comparing the versions and starting resolution on the servers.

The clients use a write-back cache for the contents of the volumes in a proprietary partition on their local disks. This cache is used by the kernel and controlled by a cache manager, which synchronizes it with the servers. Coda supports disconnected operation [20] by using the local cache even when the client is disconnected from the servers. The local changes are reintegrated to servers when the client connects again.

Sharing semantics in Coda is session semantics. The changes made by a node are propagated to server when the closes. It is necessary to detect conflicts and to provide the user facilities to repair them.

Coda does not fulfill our goals completely because it does not allow the volumes to be accessed without caching. It also is not designed for the case when the client and server are on the same node. The conflict representation is not very good and resolution of conflicts requires a special tool.

2.4. Google File System

The Google File System[21] was designed by Google to meet its needs. It is designed to be high-performance for large files that are usually read or appended sequentially in large streams. It expects that components often fail. However, it does not implement a standard API such as POSIX.

Files are split into chunks, which are 64 MB large, and these chunks are stored in a local system of chunk servers. The chunks are replicated on multiple chunk servers.

Metadata, like name space, a mapping from files to chunks and chunk locations, are managed by a replicated master server. The master periodically checks whether the chunk servers are online. The master also manages the replication. It makes placement decisions, creates new chunks and other things.

When a client wants to read a file, it sends the name and offset to the master and

(19)

the master replies with a chunk identifier and a list of chunk servers that the client should contact. The client then interacts with a chunk server that is closest to it.

When the client wants to write to the file, it asks the master what chunk server holds the lease, such a chunk server is called the primary, and which chunk servers hold the other replicas. The client pushes data to all chunk servers and then sends the write request to the primary. The primary writes the data to the chunk and forwards the write request to all other replicas, which then write the data too. After all replicas have acknowledged the write, the master replies to the client.

The Google system expects that nodes are very loosely coupled so it was not designed to support disconnected operation. No single node may access any of the system. Its design is very different from what would be required to fulfill our goals.

2.5. AliEn File System

AliEn File System (Alienfs)[22] integrates the AliEn file catalogue as a new file system type into the Linux kernel using LUFS, a hybrid user space file system framework. LUFS uses a special kernel interface level called VFS (Virtual File System) to communicate via a generalized file system interface to the AliEn file system daemon. The AliEn framework is used for authentication, catalogue browsing, file registration and read/write transfer operations. A C++ API implements the generic file system operations. The goal of AliEnFS is to allow users easy interactive access to a world-wide distributed virtual file system using familiar shell commands.

2.6. Swarmfs

Swarm[23] is a wide area peer file service that supports aggressive replication and compassable consistency behind a file system-like interface. Swarm builds failure-resilient dynamic replica hierarchies to manage a large number of replicas

(20)

across variable quality network links. It can manage replica consistency per-file, per-replica or per-session, at the granularity of an entire file or individual pages.

Swarm exports a traditional session-oriented file interface to its client applications via a Swarm client library linked into each application process. The interface allows applications to create and destroy files, open a file session with specified consistency options, read and write file blocks, and close a session. A session is Swarm's unit of concurrency control and isolation. A Swarm server also exports a native file system interface to Swarm files via the local operating system's VFS layer. The wrapper provides a hierarchical file name space by implementing directories within Swarm files.

SwarmFS[24] is a wide area peer-to-peer file system with a decentralized but uniform file name space. It provides ubiquitous file storage to mobile users via autonomously managed file servers.

(21)

Chapter 3. Motivation

3.1. Purpose and Approach

In a computational grid, a data-intensive application will require high-bandwidth data access all the way to the remote data repository. The most commonly used approach to move data between machines is the remote storage access model. Tools such as SRB provide a way for applications to transparently access data stored in a variety of storage servers. These tools present to the application an interface that is very close to the standard Unix I/O API (i.e. with primitives such as open, read, write, close), which means they are simple to use and only minimal modifications are required to existing applications.

We use clusters (See Figure 2) as the reference architecture for our study because of their increasing role as a platform not only for high performance computation but also as storage servers. We choose SRB for our study because of its frequent use in connection with data-intensive computing applications.

Describe it in common, a big disk array across all the storage nodes, also these nodes even deploy across network.

(22)

Figure 2. SRB use of heterogeneous storage

Resource is defined as any physical resource in SRB server can be different, as said SRB Filesystem support. Other file system module support like LUFS or FUSE[25], for example FTPfs can bind a FTP server to a system directory to become a Filesystem. SRB can add those additional supports as physical resource, and can be added to logical resource. SRB can combine totally over 30 kinds of resources together into a single Logical Resource.

One could access to a “Logical Resource” by any SRB Server, just like a big disk array across network. Policies should be applied, so that files could be replicated as soon as it’s been put, any request will be auto redirected to the correct physical resource and user won’t need to care about. By using either inQ or SCommand could provide storage resource selection that user can access any SRB node and still choose the same logical resource to put your file.

We implemented the User-Level Grid file system, SRBFS. It may not be as fast in SRBFS as they are in SRB. However, such SRBFS operations should be fast enough so that users are willing to trade off some performance of SRB against the familiarity,

(23)

convenience, ease of use and immediate integration with legacy systems for such commands.

3.2. SRB and Linux

3.2.1. What is SRB - Storage Resource Broker

The SRB is used to implement data grids (data sharing), digital libraries (data publication), and persistent archives (data preservation). We are now working on integration with knowledge generation systems. The goal is to provide infrastructure that makes it possible to automate all interactions with data, including discovery, access, manipulation, publication, sharing, and preservation.

The approach is based upon the organization of the digital entities into a collection, and the management of the collection.

The approach is greatly simplified by having the collection own the data, using a logical name space to manage state information about each digital entity, and using a storage repository abstraction for the operations performed upon storage systems, an information repository abstraction for the operations used to manage a collection in a database, and an access abstraction to make it easy to support additional APIs.

Logical name spaces are used for digital entities (data virtualization), users (distinguished names managed by the collection), and resources (logical resource names to support operations on sets of resources). A object storage environment is provided, in which the access to files is based on Unix file operations, rather than disk/block operations.

3.2.2. What is Linux

Linux can be used as a monolithic kernel, or as a modular kernel. It used to be a

(24)

monolithic kernel only, which meant that all device drivers and file system drivers were placed in the kernel file. In newer Linux kernels, many drivers can be built as kernel modules, which are then loaded into the kernel when they are needed. When a module is loaded into the system, it is part of the kernel, and works as if it had been in the kernel all the time, with the difference that it can be unloaded when it is no longer used.

The advantage of using kernel modules is that it is possible to develop device drivers and file system drivers without having to recompile the kernel and reboot to test it. It also makes it possible for companies to create their own kernel extensions.

This can very well be commercial software, where only the compiled module is given to the user.

The SRBFS client will be built as kernel modules, and will therefore require the Linux kernel to have module support. All major Linux distributions today always provide Linux kernels with support for modules, so this is not a problem.

The source code of my work is based on that developed by Linux and The GNU General public License. License to the source code is free and allows you to change the source code for your own usage, but not to redistribute the new source code. One is, however, allowed to distribute patches to the original source code. This is how my work is made available. All source code is written in the C programming language for Unix-like operating systems and environments. In some cases, assembler is used for platform-specific details.

3.2.3. Linux Virtual File System

The foundation of file systems in Linux is the inodes. These are what are saved on the disk and contain information about directories and all data stored in the files in the file system. File systems which do not have inodes, such as FAT (File Allocation

(25)

Table, used by for example Windows and DOS), convert what is read from the disk into inodes in memory, so that Linux treats them as any other file system.

All file system drivers in Linux are placed as a layer underneath the Linux Virtual File System (Linux VFS). Linux VFS takes care of things that are common for all file systems and then calls the specific functions for the current file system.

File systems in Linux use the Virtual File System or VFS interface. Other Unix-based operating systems, such as Solaris and BSD, also have VFS interfaces.

The VFS is the layer in the kernel that all file system calls pass through. The abstraction provided by VFS gives the kernel a consistent view of different file systems despite the often striking differences in their structures and operation. When file systems are mounted in the kernel, they are controlled by the VFS through structures of function pointers, similar to virtual function tables in C++. Each file system has its own function pointer tables. Linux’s VFS interface defines tables for directory operations, address operations, inode operations, and most file operations.

This internal interface of the VFS allows file systems to be installed without recompiling or reinstalling the kernel, since the tables can be set up while the kernel is running. The code describing a file system can then be contained in a loadable module, which will be described shortly. It is important to understand that these file operations are distinct from the system calls which share their name. User-level applications make system calls, such as open(), which eventually result in the VFS invoking the corresponding file operation. An open() system call will cause the open file operation to be invoked as the system call moves through the kernel, but the two are distinct from each other. Briefly stated, system calls exist in the user-level and file operations exist in the kernel.

The Linux virtual file system consists of four major parts:

1) The super block structure and operations which are used when a file

(26)

system is mounted into the system.

2) The inode structure and operations takes care of creating, opening, closing and other operations on inodes.

3) The file structure and operations deals with reading, writing and other operations on separate files in the file system.

4) The dentry (directory entry) structure and operations are the operations of the disk cache in the Linux VFS. It keeps recently used inodes in the memory to improve performance of the system. This is usually kept outside an

implementation of a new file system.

Figure 3. File system function call order

Figure 3 shows the order in which the different parts of a file system call occur.

When the user program performs an open call, the open operation in the Linux VFS is called and executes those parts which are common to all file systems. This step is in some cases all that is needed, and the broken arrow shows that some file systems skip the last step. Other file systems need a specific open operation, here called our_open.

If this operation exists, Linux VFS calls it, and waits for it to finish. Linux VFS then takes care of the return value from the specific operation and returns this value to the user program.

(27)

Chapter 4. SRBFS Design

4.1. Kernel Space or User Space

When writing a distributed file system, one of the important decisions is whether it should be completely located in the kernel space, or whether the majority of code should be in a user space daemon and the kernel should only contain a simple layer to redirect the VFS calls to the daemon and the replies back to VFS.

When the file system is completely in the kernel, it is faster because there is no additional indirection between the kernel and the file system. The file system also may use features of the kernel that can't be used from the user space and thus implement certain functions more effectively. On the other hand, programming and especially debugging in kernel space is more complicated.

Implementing the file system by the user space daemon also has several advantages and disadvantages. The daemon is easier to write and debug. It also can be easily restarted or upgraded. Porting the daemon to another UNIX-like operating system is not difficult too. Moreover, the nodes that do not want to access the file system but only provide the volumes to other nodes do not have to change their kernel. It also should not be much slower than the kernel solution because the performance overhead of the indirection.

Obviously, the kernel module would undoubtedly be much faster because the user-level overhead would be avoided. Fortunately, there are two projects resolving the issues and providing a simpler way to develop a file system at the kernel level;

they are LUFS and FUSE. Since LUFS is no longer being actively developed, FUSE is the logical choice. Moreover, the developer is not limited to C or C++ when using

(28)

FUSE; GmailFs is one of the more-known file systems that use the Python programming language in conjunction with FUSE.

4.2. SRBFS Architecture

FUSE[23] consists of a kernel module, a user space daemon, utilities and user land file systems implemented in shared libraries. The shared libraries are linked when the file system is mounted. If it were possible to run a network server in such a shared library, the main disadvantage would be the duplication of memory data structures and of network communication. The server in the shared library would be running only when the file system is mounted. So it would be necessary to run a separate daemon to process the requests from the other nodes. The other option would be to simply redirect the requests to the daemon via another communication link. However, it would be a superfluous indirection and it would not be much simpler than implementing a kernel module.

The most powerful and efficient solution is to implement a kernel module. It does not have the disadvantages of the previous approaches and makes it possible to implement downcalls easily. It also is not hard to implement so it was chosen. A character device is used to send packets between the kernel and the daemon, the communication protocol is the same as the protocol between nodes.

In the process of implementing a SRB client in a Linux system there are two major parts to port:

1) The connection between Linux VFS and FUSE.

2) The communication between the FUSE amd SRB client daemons.

(29)

Medadata Server

SRB Server

SRBFS operation system process

SRB User

process libfuse SRBinit

Linux VFS

FUSE

EXT2

Linux filesystems userspace

kernelspace

SRB Master SRB Agent

Figure 4. Major parts of SRBFS client

In the figure 4, the full procedure of communicating with the server side include all steps are already implemented in the SRBFS daemons.

The complete Linux SRBFS client will consist of three kernel modules:

1) FUSE core module, contains the main parts of the VFS.

2) SRBFS client module, contains the file system initialization code.

3) SRBFS client module, which is a character device used by the SRB client to communicate with the SRB Server daemons.

SRBFS is implemented as a user-space server that is called by the FUSE kernel module. As such, the file and directory abstractions of SRBFS may be accessed independently of any kernel file system implementation through libraries that encapsulate SRB communication primitives. It implementation should support all basic file system I/O operations (open, read, write, close, etc) to enable legacy applications to have direct access to SRB data. Any additional SRB specific functionality may still be implemented provided it does not conflict with the basic file

(30)

system operations required by legacy applications.

Via NFS server, SRBFS support existing applications, easily allows to share data between several computers. This implementation provides legacy applications with seamless access to SRB Server.

But the SRBFS used POSIX API, which is not well-suited for the I/O needs of parallel scientific applications. To meet this need, the MPI Forum defined an interface for parallel I/O as part of the MPI-2 Standard[26]. This interface is commonly referred to as MPI-IO. MPI-IO has many features specifically needed for parallel applications and support for noncontiguous accesses and collective I/O. MPI-IO is widely available as all MPI implementations, both vendor supported and freely available implementations, support it.

Thakur etal. defined an Abstract Device Interface (ADIO)[27] for I/O which is used to implement parallel I/O APIs. A parallel I/O API can be implemented portably across diverse filesystems by implementing it over ADIO. The ADIO interface is then implemented for each specific filesystem . This provides a portable implementation of the parallel I/O API while exploiting the specific high-performance features of

Figure 5. The ADIO Architecture

(31)

individual filesystems. Figure 5 shows how MPI-IO can be implemented on multiple filesystems by using the ADIO framework.

SEMPLAR was developed by Ohio State University It integrated the SRB remote filesystem with ROMIO[28], the MPI-IO implementation from Argonne National Laboratory. ROMIO uses the ADIO framework to implement parallel I/O APIs. A parallel I/O API can be implemented portably on multiple filesystems by implementing it on top of ADIO. The ADIO implementation is then optimized for different filesystems. This framework enables the library developer to exploit the specific high-performance features of individual filesystems while providing a portable implementation of the parallel I/O API.

SEMPLAR have provided a high-performance implementation of ADIO for the SRB filesystem. The ADIO implementation connects to the remote SRB server over TCP/IP to perform parallel I/O. Each cluster node performing I/O on the file stored in the SRB repository opens an individual TCP connection to the SRB server. The SRB ADIO implementation provides support for collective and non-collective I/O. Table 1 provides a mapping between some of the basic SRBFS, MPI-IO, ADIO and SRB API calls.

Table 1 Mapping between SRBFS,MPI-IO, ADIO and SRB API

SRBFS MPI_IO API ADIO API SRB API

srb_open MPI_File_open ADIO_Open srbObjOpen

srb_read MPI_File_read ADIO_ReadContig srbObjRead srb_write MPI_File_write ADIO_WriteContig srbObjWrite

srb_release MPI_File_close ADIO_Close srbObjClose

We investigate the possibilities of a distributed model for the SRB server. We present a comparison of wide-area file transfer performance of the SRBFS and SEMPLAR which is SRB enabled MPI-IO library for access to remote storage.

(32)

4.3. DBMS issue

Still performance is bound to database, that the more dataset is ingested the slower SRB performance would be. SDSC officially recommended Oracle as MCAT database choice. PostgreSQL[29] is recommended as “get the foot wet” choice, but a solution has been figured out using PGCluster[30] with acceptable performance.

PostgreSQL supports clustering in combination with PGCluster, which has following feature:

* PGCluster is a modified version of PostgreSQL, which is integrated in the package. Nothing differs from PostgreSQL including startup operation or original configuration files, except three more configuration files.

* Synchronous replication.

* Multi-master architecture, load balancer, replicate server can be more than one.

* Auto detects master DB failure and promotes one of cluster DB to master.

* Cluster DB can manually do recovery with master DB, this can be done once the fail cluster DB restart.

Notice that the architecture contains many roles and the same important. As the earlier introduction, bottleneck is bound with database, so that a clustered DBMS is needed. Replication between slave DB and load balance mechanism should also be covered by DBMS. MCAT server could be the same host as load balancer (so called

“front-end host”), which reduce communication but increase risk when failing. So a fail-over mechanism is also needed. One subproject of Linux-HA[31] project called

“Heartbeat” is used as a fail-over rescue of the primary front-end host.

The fail-over host works hot-standby in normal situation, once front-end host fail, fail-over host takes its IP address and start service right away. The whole process

(33)

Figure 6. Experimental Diagram of SRB Cluster

would be automatically started and finished in 30 seconds; no manual control would be needed. All session connect to the failed front-end host would be lost, but new session is connected to the active fail-over host normally.

Every storage node should be installed with a SRB server, with configuration of MCAT server hostname. In Linux-based environment, service control scripts would be very useful controlling DBMS, SRB or MCAT service.

A static library is also compiled when building SRB server. Header files can be found in the include directory. SRB provided several API such as C/C++, Java, Perl and Python. The advantage using SRB on heterogeneous storage is its file redundancy and transparency. Metadata handling in this model is not applicable, since network bottleneck slows DB down. Two approaches to improve these problems are database objects in SRB, and metadata cache in local SRB nodes (See Figure 6).

(34)

Chapter 5. Performance Evaluation

5.1. Experimental Environment

The primary factor that affects SRB performance is database, the secondary one is network environment. NCHC (National Center For High-Performance Computing)[32]

provides a high-speed network environment. Using HP Proliant DL380 server with Gigabit Ethernet bandwidth as test machines, SRB has very good performance in parallel file transfer. We have run benchmarks to measure file system performances.

These tests were all run on four test systems:

1) MetaServer: Intel Xeon 3.3 GHz with 512 MB of RAM using an internal IDE hard drive. OS is Linux 2.6.9.

2) Storage Server: Dual Intel Xeon 3.06 GHz processors with 1024 MB of RAM using an external EMC Fibre Channel RAID5 disk array[33].

3) Storage Server: Dual Intel Xeon 3.06 GHz processors with 1024 MB of RAM using an internal IDE hard drive.

4) Client Node: Dual Intel Xeon 3.06 GHz processors with 2048 MB of RAM.

Except MetaServer, all of the tests were run under Linux kernel 2.4.21 For comparison with different disk system operating under the same system loading to identify the performance-hit taken solely by the FUSE system, and isolate the performance loss from SRBFS. We consider the following file systems:

1) Ext2 file system (IDE disk).

2) Ext2 file system (Fibre Channel disk array).

3) Ext2 file system (Fibre Channel disk) + srbfs: SRBFS running on top of ext2

We also have performed all experiments at real world. The primary factor that

(35)

affects distributed SRB performance is network environment. NCHC provides a high-speed network environment – TWAREN (Taiwan Advanced Research and Education Network) [34]. The test environment consisted of a 10 Gbps link between Hsin-Chu and Tai-Nan with a RTT of 3.27 ms. Figure 7 provides a representation of the topology of our experiments.

1) Hsin-Chu SRB Server : A production SRB server (version 3.4.0) on srb1.nchc.org.tw and srb2.nchc.org.tw.

2) MCAT Server : Intel Xeon 3.3 GHz with 512 MB of RAM using an internal IDE hard drive.(Linux 2.6.9).The MCAT database is Oracle 10g.

4) NCHC PC Cluster : The NCHC cluster is a distributed memory system. It consists of 8 IBM xSeries 350 cluster nodes. The cluster totally contain 16 2.0GHz Intel Xeon processors. Each node has two 2.0GHz processors, 2 GB of memory, a 100Base-T Ethernet interface and one Gigabit Ethernet interface. The nodes run the Red Hat Enterprise Linux operating system.

5) Tai-Nan SRB Server : A beta SRB server (version 3.4.0) on srb3.nchc.orgtw and srb4.nchc.org.tw. It contains dual Intel Xeon 3.06 GHz processors with 1024 MB of RAM using an internal IDE hard drive.

Figure 7. The Experiment Topology

(36)

A variety of performance tests were run on SRBFS to evaluate its performance.

These tests include Bonnie[35], which performs various operations on a very large file; the IOZONE Benchmark[36], which evaluate file system’s performance with varying I/O request block sizes for very large files and tune the buffer size parameters to obtain optimal throughput. In order to validate the distribution model, we have implemented a prototype of the distributed SRB server. We used ROMIO Perf[28]

benchmarks to evaluate distributed SRB server.

5.2. Bonnie

The Bonnie benchmark tests I/O speed on a big file. It writes data to the file using character based I/O, rewrites the contents of the whole file, writes data using block based I/O, reads the file using character I/O and block I/O.

Bonnie runs five different workloads that show the performance difference between reads versus writes and block versus character I/O. One workload sequentially reads the entire file a character a time; another writes the file a character at a time. Other workloads exercise block-sized sequential reads, writes, or reads followed by writes (rewrite). For each workload, Bonnie reports throughput, measured in KB per second.

Therefore, Bonnie is actually a disk benchmark. Bonnie focuses on disk I/O but does not vary other aspects of the workload, such as the number of concurrent I/Os.

The program takes arguments for the size of the file it will use and the path where it will be stored. Bonnie then creates the large file and performs many sequential operations on it, including writing character by character, writing block by block, rewriting, reading character by character, reading block by block, and seeking. For this test, Bonnie was run with a file size of 1 GB.

(37)

5.1.1. Bonnie Results

The results of the Bonnie benchmark are presented in this table:

86.1. Bonnie Performance 0

20000 40000 60000 80000 100000 120000 140000

CharWrite BlockWrite Rewrite CharRead BlockRead Performance Metric

Performance MetricPerformance Metric Performance Metric KB/sKB/sKB/sKB/s

Ext2 FS (IDE Disk) Ext2 FS (FC Disk) SRB FS (FC Disk)

Figures 8. IDE, FC Disk Bonnie Results Table 2: IDE, FC Disk Bonnie Results FS Type CharWrite

(KB/s)

BlockWrite (KB/s)

Rewrite (KB/s)

CharRead (KB/s)

BlockRead (KB/s)

Ext2 FS

(IDE Disk) 22307 27105 14848 16434 80761

Ext2 FS

(FC Disk) 31298 117456 44910 28397 102623

SRBFS

(FCDisk) 25360 49945 30210 25403 49426

(38)

CharWrite (KB/s)

BlockWrite (KB/s)

Rewrite (KB/s)

CharRead (KB/s)

BlockRead (KB/s)

Ext2FS 37825 70965 108806 ²⁴⁵⁷⁸ 2017137

SRBFS ⁰ 0 0 25034 1682826

Figure 8 show the performance of the different Bonnie phases on the three file systems. For each phase and machine, we compare ext2 file system with IDE disk, ext2 file system with Fibre Channel disk array and SRBFS running on top of ext2 with Fibre Channel disk array. The performance metric is in KB/s as measured by Bonnie. The point is to measure how much performance is lost by using SRBFS and how it breaks down into FUSE overheads.

It is important to point out that in some cases layering ext2 below FUSE actually increases its performance. We expect that this is due to buffering effects as there is now an additional process which can buffer. The largest impact on block reads and block writes speed is on machines with fast Fibre Channel controller caching effects.

SRBFS also use for an NFS File Server. The user has the look and feel of a standard NFS file system. In the following tables, Ext2FS and SRBFS means measurements taken on the local file system. NFSv3 denote measurements over NFS protocol 3 respectively using the UDP protocol.

Table 3. SRBFS Bonnie Performance

(39)

Figure 9. SRBFS Bonnie Performance

Figure 9 show the results are very good in read data test run. On the other hand, the performance is poor in the write data in both character oriented I/O and block oriented I/O. This is probably due to the fact that SRBFS do not use the caching mechanism.

5.2. IOZONE Benchmark

IOZone Benchmark - IOZone is a filesystem benchmark tool. The benchmark generates and measures a variety of file operations and is useful for determining a broad filesystem analysis of a computer platform. The benchmark tests filesystem’s performance with varying I/O request block sizes for very large files.

These tests were executed on client nodes of both local file system and network fie system in our test environment. Additional iozone tests were performed on the storage and metadata servers in our environment in order to determine the baseline

0 500000 1000000 1500000 2000000 2500000

CharWrite BlockWrite Rewrite CharRead BlockRead Perfo rman ce Metric

Perfo rman ce Metric Perfo rman ce Metric Perfo rman ce Metric KB/sKB/sKB/sKB/s

Ext2 FS SRBFS

(40)

performance for our supporting storage hardware. The hardware performance analysis was done by writing a file larger than the amount of RAM available for a system in order to avoid any caching effects. The client tests wrote 8 GB files.

5.2.1. IOZONE Results

The results of the IOZONE benchmark are presented in following figures. Figure 10,11. show the baseline performance of the individual storage system components, locally running the EXT2FS file system, is indicated on each of the figures for comparison to the SRBFS evaluated. The HP DL380 storage server, with the attached EMC CX600 RAID5 disk array, exhibited an average read bandwidth for all I/O request sizes of 82.4 MB/sec and a write bandwidth of 111.5 MB/sec. The metadata server exhibited an average read bandwidth for all I/O request sizes of 40 MB/sec and a write bandwidth of 41.2 MB/sec. When a file that is shared read-only by several clients is subsequently opened for user .It should be applicable application.

(41)

0 20 40 60 80 100 120 140

64 128 256 512 1024 2048 4096 8192 16384

I/O Request Size (KB)

Write Bandwidth(MB/sec)

EXT2FS(Metadata Server)

EXT2FS(Storage Server HP DL380) NFS

SRBFS

Figure 10. IOZONE Write Perforrmance

0 20 40 60 80 100

64 128 256 512 1024 2048 4096 8192 16384

I/O Request Size (KB)

Read Bandwidth(MB/sec)

EXT2FS(Metadata Server)

EXT2FS(Storage Server HP DL380) NFS

SRBFS

Figure 11. IOZONE Read Performance

In order to tune the buffer size parameters to obtain optimal throughput. We study the effects of varying the buffer size. We run IOZONE from the NCHC clusters running with SRBFS as the clients. One factor which generally affects SRBFS I/O performance is the buffer size used for transferring data.

(42)

SRBFS buffer size tuning is a convenient way of obtaining maximum throughput.

The results of the IOZONE benchmark are presented in following figures.

The performance analysis was done by writing and reading varying file smaller than the amount of RAM available for a system. Because of buffer cache effects, we can be seen for file sizes of 32Mbytes to 512Mbyte read and write unrealistically performance, which is greater than network bandwidth.

Figure 12 shows the write performance results with varying the buffer size. With Figure 13. Read performance comparison with varying buffer cache size

Figure 12. Write performance comparison with varying buffer size

(43)

increasing the size of the buffer, and we improve the performance, showing that increasing the sequential benefits. Figure 13 shows the read performance. The plot with the plateau occurring at around 580 MB/Sec. When file size exceed 256MB, the performance will not increase.

5.3. ROMIO Perf Benchmark

ROMIO Perf a sample MPI-IO program included in the ROMIO source code. The Perf benchmark is a parallel I/O test program provided with the OPEN-MPI standard distribution. It measures the read and write performance of a filesystem. Every process writes a 1 MB chunk at a fixed location determined by its rank, and then reads it back later. The chunk size is user-defined. There is no overlap between chunks. We run this benchmark with a set of files size of 64MB.

5.3.1. ROMIO Perf Results

We examine the scalability of SRB server in distributed setups. In the base distributed configuration we create a single server and 12 clients as a testbed and make a comparison with SRBFS, SEMPLAR and NFS. Figure 14 and 15 show SRBFS, SEMPLAR and NFS I/O results for ROMIO Perf benchmark on a single SRB server.

Figure 14 show that SEMPLAR scales as the number of clients increases for write, reaching a maximum throughput of about 158 MBytes/s in the 8 clients. SRBFS results show scaling behavior similar to SEMPLAR. They are clearly shown to have much better scalability than NFS. Figure 15 provides the comparison of SEMPLAR and SRBFS reading, which have good performance scaling follows the increasing clients.

(44)

Figure 15. SEMPLAR, SRBFS and NFS Perf Read Performance Figure 14. SEMPLAR, SRBFS and NFS ROMIO Perf Write Performance

(45)

Figure 16 and 17 show our multi-nodes SRB server results with ROMIO Perf benchmark. The scaling from 1 to 4 SRB servers is near linear and attained an aggregate write bandwidth of 225 MBbps with 12 clients and 4 SRB servers.

Figure 16. Distributed SRB Server Perf Write Performance

Figure 17. Distributed SRB Server Perf Read Performance

(46)

Chapter 6. Conclusions and Future Work

We have described the design and implementation of SRBFS, a comprehensive Grid file system for Linux. SRBFS is implemented as a user-level file system using FUSE. When running on top of the standard Linux Ext2 file system, its overhead is quite low for modes of use. SRBFS buffer size tuning is a convenient way of obtaining maximum throughput. We examine the scalability of SRB server in distributed setups. In the base distributed configuration we create a single server and 12 clients as a testbed and make a comparison with SRBFS, SEMPLAR and NFS.

SRBFS and SEMPLAR have good performance scaling follows the increasing clients.

This thesis also describes the distributed model for the data grid environment, we do believe that this model is one of the feasible approaches to improve the overall performance in the grid communities.

We make a comparison of wide-area file transfer performance of the SRBFS and SEMPLAR for access to remote storage. The bandwidth observed by these benchmarks scaled with the number of clients and SRB servers.

On the NCHC cluster, the ROMIO Perf benchmark attained an aggregate write bandwidth of 225 MBbps with 12 clients. The benchmark results are encouraging and show that distributed SRB server provides applications with scalable, high-bandwidth I/O across wide area networks.

Future work we are considering several extensions and applications for SRBFS.

Any methods designed to speed up execution, such as metadata caching, need to be implemented.

(47)

References

[1] The DataGrid project, http://eu-datagrid.web.cern.ch

[2] Chaitanya Baru, Reagan Moore, Arcot Rajasekar, and Michael Wan. The SDSC storage resource broker. In Proceedings of IBM Centers for Advanced Studies Conference. IBM, 1998.

[3] A. Azagury, V. Dreizin, M. Factor, E. Henis, D. Naor, N. Rinetzky, O.

Rodeh, J. Satran, A. Tavory, and L. Yerushalmi. Towards an object store.

In The 20th IEEE Symposium on Mass Storage Systems,2003, pages 165–, 2003.

[4] Chin-Chen Chu, Ching-Hsien Hsu and Kai-Wen Lee, “SRBFS: A

User-Level Grid File System for Linux,” Proceedings of the 2nd Workshop on Grid Technology and Applications (WoGTA’05), pages 31-36, Dec.

2005.

[5] Chin-Chen Chu and Ching-Hsien Hsu, “Performance Evaluation of

Distributed SRB Server,” Proceedings of the 3rd Workshop on Grid Technology and Applications (WoGTA’06), pages 181-186, Dec. 2006 [6] N. Ali, M. Lauria. “SEMPLAR: High-Performance Remote Parallel I/O

over SRB”, 5th IEEE/ACM International Symposium on Cluster Computing and the Grid, Cardiff, UK, May 2005.

[7] John H. Howard. An overview of the andrew file system. In Proceedings of the USENIX Winter Conference, pages 23–26, Berkeley, CA, USA, Jan.

1988.

[8] B. Callaghan, B. Pawlowski, and P. Staubach. RFC 1813: NFS version 3 protocol specification, Jun. 1995.

[9] An Overview of DFS,

http://www.transarc.com/Library/documentation/dce/1.1/dfs_admin_gd_1.

html

[10] Grid File System Working Group (GFS-WG),

http://phase.hpcc.jp/ggf/gfs-rg/

[11] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. The International Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128, 1997.

[12] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi. Grid datafarm architecture for petascale data intensive computing. In Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), pages 102–110, 2002

(48)

[13] Andrew S. Grimshaw, William A. Wulf, and the Legion team. The legion vision of a worldwide virtual computer. Communications of the ACM, 40(1):39–45, Jan. 1997.

[14] E. Nallipogu, F. Ozguner, and M. Lauria. Improving the Throughput of Remote Storage Access through Pipelining. In Proceedings of the Third International Workshop on Grid Computing, pages 305–316, 2002.

[15] K. Bell, A. Chien, and M. Lauria. A High-Performance Cluster Storage Server. In Proceedings of the 11th IEEE International Symposium on High Performance

[16] Osamu Tatebe, Grid Datafarm Architecture and Standardization of Grid File System, ISGC2004, Taipei, Jul. 2004.

[17] Linux Userland FileSystem, http://lufs.sourceforge.net/lufs

[18] White, B., M. Walker, M. Humphrey, and A. Grimshaw. LegionFS: A Secure and Scalable File System Supporting Cross-Domain

High-Performance Applications. In Proceedings of Supercomputing 2001, Denver, CO, Nov. 2001.

[19] M. Satyanarayanan et al. Coda. http://www.coda.cs.cmu.edu, 2004.

[20] J. Kistler and M. Satyanarayanan. Disconnected operation in the Coda file system. In Proc.13th Symposium on Operating Systems Principles, pages 213–225, Oct. 1991.

[21] S. Ghemawat, H. Gobioff, S.-T. Leung. The Google File System.Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pages 29-43, 2003.

[22] A.J.Peters, P.Buncic, P.Saiz, AliEnFS - a Linux File System for the AliEn Grid Services, these proceedings, THAT005

[23] S. Susarla and J. Carter. DataStations: Ubiquitous transient storage for mobile users. Technical Report UUCS-03-024, University of Utah School of Computer Science, Nov. 2003.

[24] Sai Susarla; Carter, J., Flexible Consistency for Wide Area Peer Replication Distributed Computing Systems, 2005. ICDCS 2005.

Proceedings. 25th IEEE International Conference, pages 199 – 208, Jun.

2005.

[25] Filesystem in Userspace,. http://sourceforge.net/projects/avf

[26] M. P. I. Forum. MPI-2: Extensions to the Message-Passing Interface.

http://www.mpi-forum.org/docs/docs.html, 1997.

[27] R. Thakur, W. Gropp, and E. Lusk. An Abstract-Device Interface for Implementing Portable Parallel-I/O Interfaces. In Proceedings of the Sixth

(49)

Symposium on the Frontiers of Massively Parallel Computation, pages 180–187, 1996.

[28] ROMIO: A High-Performance, Portable MPI-IO Implementation.

http://www.mcs.anl.gov/romio

[29] PostgreSQL, http://pgcluster.projects.postgresql.org [30] PGCluster, http://pgcluster.projects.postgresql.org [31] High-Availability Linux project, http://linux-ha.org

[32] National Center For High-Performance Computing, http://www.nchc.org.tw

[33] EMC CLARiiON Disk Systems data sheet,

http://www.emc.com/products/systems/clariion_pdfs/C1075_cx_series_ds.pdf [34] Taiwan Advanced Research and Education Network,

http://www.twaren.net

[35] The Bonnie Benchmark, http://www.textuality.com/bonnie [36] Iozone filesystem benchmark., http://www.iozone.org

中 中

中

中 中

中 華 華 華 華 大 大 大 大 學 學 學 學

碩 碩 碩

碩 士 士 士 士 論 論 論 論 文 文 文 文

基於 基於 基於

基於 SRB SRB SRB 核心技術的分散式檔案系 SRB 核心技術的分散式檔案系 核心技術的分散式檔案系統 核心技術的分散式檔案系 統 統 統

SRBFS: A User-Level Grid File System for Linux

系 所 別: 資訊工程學系碩士班 學號姓名: E09002057 朱金城 指導教授: 許 慶 賢 博士

中華民國 九十六 年 七 月

SRBFS: A User-Level Grid File System for Linux

By

Chin-Chen Chu

Advisor: Prof. Ching-Hsien Hsu

Department of Computer Science and Information Engineering

Chung-Hua University, Hsinchu, Taiwan

July 2007

Abstract

中文摘要

Acknowledgements

Table of Contents

List of Figures

List of Tables

Chapter 1. Introduction

1.1. Motivation

1.2. Objective

1.3. Thesis Organization

Chapter 2. Background and Related Works

2.1. GfarmFS

2.2. LegionFS

2.3. CodaFS

2.4. Google File System

2.5. AliEn File System

2.6. Swarmfs

Chapter 3. Motivation

3.1. Purpose and Approach

3.2. SRB and Linux

3.2.1. What is SRB - Storage Resource Broker

3.2.2. What is Linux

3.2.3. Linux Virtual File System

Chapter 4. SRBFS Design

4.1. Kernel Space or User Space

4.2. SRBFS Architecture

4.3. DBMS issue

Chapter 5. Performance Evaluation

5.1. Experimental Environment

5.2. Bonnie

5.1.1. Bonnie Results

5.2. IOZONE Benchmark

5.2.1. IOZONE Results

5.3. ROMIO Perf Benchmark

5.3.1. ROMIO Perf Results

Chapter 6. Conclusions and Future Work

References

中中

中中

中華華華華大大大大學學學學

碩碩碩

碩士士士士論論論論文文文文

基於基於基於

基於 SRB SRB SRB 核心技術的分散式檔案系 SRB 核心技術的分散式檔案系核心技術的分散式檔案系統核心技術的分散式檔案系統統統

系所別: 資訊工程學系碩士班學號姓名: E09002057 朱金城指導教授: 許慶賢博士

中華民國九十六年七月