Increasingly,scientiﬁcadvancesrequirethefusionoflargeamountsofcomplexdatawithextraordinaryamountsofcomputationalpower.Theproblemsofdeepsciencedemanddeepcomputinganddeepstorageresources.Inadditiontoteraﬂop-rangecomputingengineswiththeirownlocalstorage,faci

(1)

Deep scientific computing

requires deep data

W. T. C. Kramer A. Shoshani D. A. Agarwal B. R. Draney G. Jin G. F. Butler J. A. Hules Increasingly, scientific advances require the fusion of large

amounts of complex data with extraordinary amounts of computational power. The problems of deep science demand deep computing and deep storage resources. In addition to teraflop-range computing engines with their own local storage, facilities must provide large data repositories of the order of 10 –100 petabytes, and networking to allow the movement of multi-terabyte files in a timely and secure manner. This paper examines such problems and identifies associated challenges.

The paper discusses some of the storage systems and data management methods that are needed for computing facilities to address the challenges and describes some ongoing

improvements.

Introduction

Deep scientific computing has evolved to the integration of simulation, theory development, and experimental analysis as equally important components. The integration of these components is facilitating the investigation of heretofore intractable problems in many scientific domains. Often in the past, only two of the components were present:

Computations were used to analyze theoretical ideas and to assist experimentalists with data analysis. Today, however, beyond each component informing the others, the techniques in each domain are being closely interleaved so that science investigations increasingly rely on simulations, observational data analyses, and theoretical hypotheses virtually simultaneously in order to make progress.

High-performance computing is now being integrated directly into some experiments, analyzing data while the experiment is in progress, to allow real-time adaptation and refinement of the experiment and to allow the insertion of human intuition into the process, thus making it very dynamic. When computational models operate in concert with experiments, each can be refined and corrected on the basis of the interplay of the two. The integration of computing with the other investigative methods is improving research productivity and opening new avenues of exploration.

In many cases, investigations have been limited by the computational power and data storage available, and these constraints, rather than the scale of the question being studied, have determined the resolution of a simulation or the complexity of an analysis. As available computational power, memory, and storage capacity increase,

investigations can be expanded at a natural scale rather than being constrained by resources. But deep scientific computing can still be constrained by an inadequate capability to cope with massive datasets. In order to handle massive amounts of data, attention must be paid to the management of temporary and long-term storage, at the computing facility and elsewhere, and to networking capabilities to move the data between facilities.

An important aspect of the challenge of deep computing is the fact that today and in the foreseeable future no computational system can hold all needed data using on-line, local disk storage. As discussed later, for many applications, each step of a simulation produces gigabytes (GB) to terabytes (TB) of data. A deep computing system is used by multiple applications and for many time steps, so any delay in being able to move and access the data means under-utilizing the computational resource. Thus, a key subsystem in every facility involved in deep computing is a large data archive or repository that holds hundreds of terabytes to petabytes (PB) of storage. These archives

娀Copyright 2004 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of

this paper must be obtained from the Editor.

209

(2)

are composed of a hierarchy of storage methods ranging from primary parallel disk storage to secondary robotic tape storage to possibly tertiary shelf-based tape storage.

The efficient management of data on such systems is essential to making the computational systems effective.

While large-scale computational and storage systems have been in place for decades, it is only in the past 20 years that networking began to change the way in which computing is performed. Initially, this was done via remote log-in access over connections that were relatively slow compared to the computing power and storage of the time. Since the mid-1990s, networking capabilities have evolved to the point that they have significantly changed the way in which large-scale resources are used. Indeed, the explosion of raw Internet bandwidth is enabling people to envision new paradigms of computing. One such new paradigm is Grid computing with the Open Grid Service Architecture [1]. Flexible access to computing and storage systems is now being implemented as a part of the Grid. This paper does not deal specifically with Grid issues, but concentrates on the underlying functions and methods required to enable distributed systems to reach their full potential.

Network capabilities have seen manyfold fundamental improvements in hardware, such as the change from copper-based networking to optical-fiber-based

networking. Although these hardware improvements are expected to continue into the future, the performance of the networking protocols that were designed to operate on significantly lower-speed networks has not grown with the network capacity. End host paths from memory to the network also have often not kept pace with the improvements in the network capabilities. These lags have caused serious limitations in the end-to-end efficiency and utilization of applications running on the network. End-to- end networking technology must now keep pace, or it will not be able to match the exponentially increasing computational power of new systems and the dramatic increases in storage capacity. Middleware associated with the Grid introduces even more demands on the underlying data and network infrastructure. Furthermore, the protection of intellectual and physical assets in a networked environment is critical.

Because of the intense on-demand needs of many applications, a new requirement is emerging—the widespread deployment of high-performance network connections within and across shared networks. This requirement is different from the principles that led to the scalable Internet, which has grown rapidly over the last 10 to 15 years. The use of these network connections by multiple scientific fields will entail new concepts of fairness and new modes of network operation, monitoring, and management. The need for new solutions is

heightened by the rapid development of Grids, which

implicitly assume that adequate networks capable of quantifiable high performance will be available on demand for priority tasks.

In short, the use and movement of deep data adds another level of complexity to high-performance

computing. This paper discusses several of the challenges posed by the need to handle massive amounts of scientific data at the very high end, and describes some possible approaches for doing so. It also examines the interplay among the three elements that make up deep computing:

computation, storage, and networking. Unless these three are balanced, high-end computing will be less effective in addressing future needs.

The rest of this paper is organized as follows: The first section discusses examples of the applications that drive deep-data science in order to identify the capabilities and services needed to support them. The second section deals with methods of providing and managing associated large data repositories. The third section discusses networking problems that prevent deep data from flowing efficiently through the network and presents some methods of recognizing and resolving these problems. These sections include the following themes:

● Deep science applications must now integrate simulation with data analysis. In many ways this integration is inhibited by limitations in storing, transferring, and manipulating the data required.

● Very large, scalable, high-performance archives, combining both disk and tape storage, are required to support this deep science. These systems must respond to large amounts of data— both many files and some very large files.

● High-performance shared file systems are critical to large systems. The approach here separates the project into three levels—storage systems, interconnect fabric, and global file systems. All three levels must perform well, as well as scale, in order to provide applications with the performance they need.

● New network protocols are necessary as the data flows are beginning to exceed the capability of yesterdays protocols. A number of elements can be tuned and improved in the interim, but long-term growth requires major adjustments.

● Data management methods are key to being able to organize and find the relevant information in an acceptable time. Six methods are discussed that can be built into the applications and eventually into the underlying storage and networking infrastructure.

● Security approaches are needed that allow openness and service while providing protection for systems.

The security methods must understand not just the application levels but also the underlying functions of storage and transfer systems.

210

(3)

● Finally, monitoring and control capabilities are necessary to keep pace with the system improvements.

This is key, as the application developers for deep computing must be able to drill through virtualization layers in order to understand how to achieve the needed performance.

Applications that drive deep data science Ideally, users would like all resources to be virtualized and not to have to deal with storage or transfer components and issues. However, in deep computing, the virtualization scheme breaks down because of the sheer magnitude of the problems and data. Virtualization implementations are typically targeted to more general cases in magnitude and intensity. Hence, for deep computing, the user and application often have to know much more about the implementation and details of the features and functions of a system than they would like. The following examples demonstrate that it is no longer possible to consider an application as limited by computation, networking, or storage. All are needed simultaneously. The main reason for using distributed computing resources is that the data size and/or the computational resource requirements of the task at hand are too large for a single system. This motivates the sharing of distributed components for data, storage, computing, and network resources, but the data storage and computational resources are not necessarily sited together. In this section, we describe several phases of the scientific exploration process that illustrate such requirements in order to identify the capabilities and services needed to support them, emphasizing end-to- end performance from the users point of view.

Data production phase

Many scientific projects involve large simulations of physical phenomena that are either impossible or too expensive to set up experimentally. For example, it is too expensive to set up combustion or fusion experiments to investigate the potential benefit of a new design or to discover design errors. Instead, detailed simulations, usually involving high granularity of the underlying mesh structures, are used to screen candidate designs. Similarly, simulations are used in climate modeling because it is impossible to recreate climate phenomena accurately in a laboratory. In high-energy physics, simulations are conducted before the actual multi-billion-dollar experiments in order to design the hardware and software systems necessary to process the data from the experiment.

The above examples are typical of simulations that produce multi-terabyte datasets from long-running parallel computations. Providing such simulations with adequate computing resources may involve a single site with a large computing facility or an aggregation of multiple computing

resources at multiple sites. There must be disk storage resources large enough to hold the simulation data and fast enough to keep pace with its generation so that the computing resources are used effectively. Furthermore, the data must be moved to deep archives as it is generated in order to free up the disk storage as rapidly as possible.

An example of generating a large volume of data during the simulation phase is a colliding black hole simulation [2]

performed at the National Energy Research Scientific Computing Facility (NERSC). The collision of two black holes and the resulting gravitational waves were simulated. Since the gravitational wave signal that can be detected by interferometers in the field is so faint as to be very close to the level of noise in these devices, the simulated wave patterns are important tools for data interpretation. The code used performs a direct evolution of Einsteins general relativity equations, which are a system of coupled nonlinear elliptic hyperbolic equations that contain millions of terms if fully expanded.

Consequently, the computational and storage resource requirements just to carry out the most basic simulations are enormous. These simulations had been limited by both the memory and the CPU performance of todays supercomputers.

One of the simulations, depicted in Figure 1, used 1.5 TB of RAM and more than 2 TB of disk storage space per run on the NERSC IBM SP* system. Runs typically consumed 64 of the large-memory nodes of the SP (containing a total of 1,024 processors) for approximately 48 wall-clock hours at a stretch. In the space of three months, these simulations consumed 400,000 CPU hours, simulating one full orbit before coalescence. Not only was this simulation very intensive in memory, on-line I/O, and CPU requirements, but it had extreme networking needs as well. In addition to the challenge of moving so much data off the computational engine to a storage archive, the entire application was designed to be interactively monitored and steered using advanced visualization tools.

For example, at the 2002 Supercomputing Network Bandwidth Challenge competition (SC2002), this application used almost 17 gigabits per second of data bandwidth for the full application to be visualized in real time across systems at seven sites in four different countries [3]. Now that the concept has been successfully demonstrated, future efforts to expand the time scale of the simulation for a more complete understanding are expected to require 5 TB of RAM, 10 TB of on-line disk storage per run, and more than six million CPU hours.

Another example of an intensive data production phase is in the area of climate modeling. A recent data production run completed the first 1,000-year control simulation of the present climate [5] with the new Community Climate System Model (CCSM2) [6]

developed at the National Center for Atmospheric 211

(4)

Research (NCAR). This simulation produced a long-term, stable representation of the earths climate. Few climate models in the world can achieve this combination of accuracy, consistency, and performance; previous

simulations contained too much drift to allow a complete, uncorrected simulation of 1,000 years. Computationally, the full CCSM2 code is complex, consisting of five integrated models that are organized to execute concurrently within a single job. The components exchange data at various frequencies appropriate to the large-scale physical processes being simulated through a

“flux coupler” component. Each simulated year requires 6 GB of data to feed the next step in the simulation, and many intermediate files are produced. The requirements for this ongoing effort increase by a factor of 2 or more every year.

Data post-processing phase

Post-processing involves running application programs to interpret simulated or observed data. While some applications, such as the black hole simulation, use very little input data in the data production phase, the post- processing phase requires access to entire datasets generated by simulation programs or experiments. Post- processing components must be capable of performing the computation at the sites where the data is located or moving the data and the computation to a common site.

Depending on the amount of data to be moved, this phase may be very lengthy. However, in many applications it is possible to overlap the movement of the input data with the computation, if the interpretation programs do not require all of the data at once. The interpretation programs may generate datasets larger than the input datasets.

An example of the post-processing of experimental data is the work of the Nearby Supernova Factory (SNfactory) [7]. Discovering supernovae as soon as possible after they explode requires imaging the night sky repeatedly, returning to the same fields every few nights, and then quickly post-processing the data. The most powerful imager for this purpose is the charge-coupled device (CCD) camera built by the Jet Propulsion Laboratory.

This camera delivers 100 MB of imaging data every 60 seconds, and an upgraded version of the camera will more than double this. The new images are computationally compared to images of the same field using digital image subtraction to find the light of any new supernovae. Because the amount of data is so large (50 GB per night per observatory, or 18.6 TB per year), the image archive even larger, and the computations so extensive, it is critical that the imaging data be transferred to a large computing center (in this case NERSC) as quickly as possible. The refined data is then analyzed and compared to theoretical simulations in order to select 212

(5)

candidate stars to watch more closely. The candidate list is then distributed to observatories around the world. This time-centered processing has resulted in a dramatic increase in the rate of detection of Type Ia supernovae, which now averages more than eight per month. This project brought new understanding of the universe and its fate, concluding almost a century of debate— one of the key scientific discoveries in recent times. The project is now contributing to the design of future experiments, such as the Supernova/Acceleration Probe (SNAP), a satellite which is now being developed.

Another example of the need for post-processing is the U.S. Department of Energy (DOE) Coupled Climate Model Data Archive, which makes output data from DOE-supported climate models freely available to the climate-modeling community [8]. This is the largest single collection of publicly available climate model output.

Results from the NCAR-coupled general circulation models, PCM (pulse-code modulation) and CCSM2, are currently available. The data in these archives has been post-processed from the original output data so that it can be stored and accessed in a database that makes the data more convenient and useful to climate researchers.

The volume of the post-processed data includes various summaries of the data as well as an inverted representation of the data, organized as time series per variable. Consequently, the volume of the post-processed data exceeds the volume of the original simulated data.

Current network limitations affect users of this climate data collection in two ways. First, and foremost, individual researchers generally download subsets of this data to their own remote sites for inclusion in their own specialized analysis programs. Hence, slow networks can limit the amount of data analyzed in a practical way.

Second, because of the volume of post-processed data, several days are often necessary to transfer the contents of an entire simulation from NCAR mass storage to the NERSC High Performance Storage System* (HPSS). This is a substantial fraction of the time required to generate the post-processed data. In the future, projects such as the Earth System Grid [9] offer the prospect of supporting efficient distributed access to the collection. In this vision, model data would reside on storage media at the supercomputing center that produced the data. Metadata catalogs and interpretation programs would provide a seamless interface to the database, hiding the distributed nature of the underlying files. For this concept to be practical, however, network speeds must be increased substantially over current rates.

Data extraction and analysis phase

This phase generally involves the exploration of selected subsets of the data in order to gain insights into the data and to reach and present new conclusions about the data.

The storage of data away from the site where the analysis is being done forces the use of a distributed computing model. For example, consider the need to create a sequence of images of the temperature variation over some region of the world for a certain ten-year period.

The simulation may contain data for the entire globe over hundreds of years for 20 to 30 different variables in addition to temperature. The problem here is to extract the subset of the data needed, perhaps from multiple archives, and move it to the visualization site. The main capabilities and services required in this case are computing and disk storage resources. But the amount of space needed is only for the selected subset, typically a small fraction of the original dataset. Applications filter and extract the desired data at the location where the data resides, and move only the filtered data to the clients site. Assembly of the filtered data requires invoking an assembly application program and handing it the filtered subsets of the data.

A large-scale data analysis effort involving hundreds to thousands of collaborators worldwide is typical in several high-energy and nuclear physics experiments. One example that recently started full production is the STAR detector (Solenoidal Tracker At RHIC) at Brookhaven National Laboratory. The data analysis and simulation studies require the extraction of subsets of the data to be used by 400 –500 collaborators from about 50 institutions worldwide. The post-processing phase takes the raw data from the detector and reconstructs particle tracks, momenta, and other data about collisions. In the data extraction and analysis phase, physics results are derived by carrying out statistical analysis of large numbers of particle collisions that must be extracted from archived files.

The STAR detector produces more than 300 TB of data per year, and it is only one of the experiments at the Relativistic Heavy Ion Collider (RHIC). All told, the four experiments at RHIC produce between 1 and 1.5 PB of data per year, and newer experiments will be even more voluminous. In 2007, when the Large Hadron Collider (LHC) goes on line at CERN in Europe, just one of the experiments, ATLAS (A Toroidal LHS ApparatuS), is expected to produce 1.5 PB of raw data per year. ATLAS is the largest collaborative effort ever attempted in the physical sciences, with 2,000 physicists participating from more than 150 universities and laboratories in 34 countries. The direct interaction of so many widely dispersed collaborators is made possible by tools for efficiently accessing, organizing, and automatically managing massive datasets.

Another rapidly growing area of science that will require efficient data extraction and analysis tools is genomics and bioinformatics. While currently relatively

small in data requirements compared with some of the 213

(6)

other disciplines, bioinformatics has large and highly distributed data needs that are growing at exponential rates. A single assembly of the fish Fugu rubripes [10], done with the JAZZ Genome Assembler [11] created by the DOE Joint Genome Institute, generated 30 GB of data files and used 150 GB of working space—and this species has an unusually small genome for a vertebrate.

The leading sequencing facilities are now able to sequence one or more organisms a day, and the rate of increase with new technology is such that more and more raw sequences are being produced. Research in comparative genomics will require the extraction of datasets from the genomes of many different species. Projections are that within five years, many sites will have 100 TB of genomic data stored in the form of assembled and annotated genomes. If the raw image data were completely saved in digital form, the data requirements could be as much as 1,000 times greater.

Dynamic data discovery process

In the above phases, we assume that all of the input data for a computational job is available prior to execution.

However, there is a growing trend toward more adaptive simulations, in which the input data required by an analysis depends on the results of just-executed

computations. We can refer to this method of exploration as dynamic data discovery. A good example of this process is data mining, in which the researchers initially may not know exactly what they are looking for, but they want to find and map correlations and see which correlations represent significant trends. Agile data access and management techniques are a necessity for this kind of research, and detailed pre-planning of the data transfers is not always possible.

Another instance of dynamic data discovery is the running of simulations of different resolutions or initial conditions simultaneously with mutual feedback between the simulations so that they can refine each others results.

For example, typical global climate models today cannot resolve very narrow current systems (including fronts and turbulent eddies) that play a crucial role in the transport of heat and salt in the global ocean, nor can they resolve important sea ice dynamics that occur in regions of complicated topography, such as the Canadian

Archipelago. Feeding data from the global model into a higher-resolution regional model, then transferring the regional results back to the global model, could increase the precision and accuracy of climate simulations.

Still another important example of dynamic data discovery is computational steering of a simulation on the basis of analysis or visualization of the current results.

With interactive visualization, the client may choose to

“stir” the simulation parameters using visualization-based tools, or to zoom in to obtain higher-granularity data for

a more limited space. In this case, it may be necessary to change the plan of execution on the basis of observations of partial results. Computational steering requires that a control channel to the executing service be open, that the execution process be interruptible, and that a new or modified plan can be submitted. As in the previous examples, there is no implication that logical subtasks have to be performed in a sequential fashion. On the contrary, all subtasks should be performed in parallel if possible.

Table 1summarizes the current and projected storage requirements [12] for several DOE Office of Science scientific disciplines using NERSC. Each discipline has multiple simulations and analyses going on simultaneously.

The user’s view: End-to-end performance and function

An important goal in building deep computing capabilities that address the needs of scientists is providing

responsiveness to the user. From the users point of view, performance is measured by the time between the initiation of an action and its completion. In the case of a data transfer, this may be the period of time from the point at which the user issues the command to transfer the data to the point at which the data is available for use on the target machine. In order to effectively do scientific processing, the entire data path—including the path through the machine, the storage, the archives, and the networks—must present as little delay as possible and a path of the highest bandwidth possible for operations to occur in a timely manner. In the simple case, data flows from the memory of a source system, through a network interface card, over the local network, through the network interface of the destination system, and into its memory (and perhaps to on-line storage). Even in this simple model, there are a number of potential bottlenecks.

The simple model presented above is rarely the reality.

The data path is usually much more complex and involves many more components, including routers, storage systems, archives, and other networks. Consider the example of a user working on a large-scale system located at a different site. The system may support computation and/or experimental analysis for any of the previously discussed phases. The user has a desktop and also a small server system that has relatively modest data and computational capability. The large-scale system generally has the data archived in a mass storage system. Other storage resources may also be assigned temporarily to the job to run the computation. In the simplest case, the computing and storage resources are all in one system, and the internal switch fabric is used. More often, the mid- to long-term storage is provided by another system, connected by an Internet Protocol (IP) or Fibre Channel 214

(7)

[13] network. When the application must move the simulation data to the archive, the data is organized into a large number of files whose movement to the archive is reliable and verifiable. After files are moved to the archive, the temporary storage is released automatically (garbage collection) for other uses. File movement from temporary storage to the archive can start as soon as each file is generated, which requires monitoring and progress reporting of file movement. A long-lasting job that may take many hours cannot be expected to be restarted in case of partial failure, so checkpoint and restart capabilities must also be supported. Also, in contrast to the simple case, the data must now flow through routers in the network. A router buffers each packet as it arrives and then sends it to the next router along the path toward the destination. At each step or hop, the packet may be redirected, broken into smaller parts, rejected, delayed, or just lost. The paths traveled by packets are determined by the routing protocols and the current status of the network.

Sometimes different packets from the same data stream take different paths, at different speeds, and with different numbers of hops. It is not possible for the user to determine whether any of this is occurring, but any step along the way may affect and degrade end-to-end performance. Another factor in the achieved performance over the network is the transport protocol: Many protocols include some form of flow and/or congestion control which can limit their sending rate.

Data repositories for HPC systems

As mentioned above, any high-performance computing (HPC) facility supporting deep science must have multiple subsystems. There are one or more computing platforms, local data storage at the computing platforms, a data repository archive, visualization and other pre- and post- processing servers, local networking, and connections to one or more wide-area networks. Some facilities also have robotic tape storage. Figure 2 shows the logical diagram of one such site—the flagship supercomputing facility of the DOE Office of Science, the NERSC Facility [14], located at the Lawrence Berkeley National Laboratory. The challenge at such facilities is to make efficient use of the available resources while performing the computations in an effective and timely manner. This challenge requires efficient individual storage components and software that can manage the combination of these components effectively. In the remainder of this section, we discuss the design of storage components and the software to

effectively manage and stage for computation the data on the storage resources. Topics include storage systems, unified file systems, and managing large datasets in shared storage resources.

High Performance Storage System

The High Performance Storage System* (HPSS) [15] is one of several systems that serve to provide data storage and archive repositories. While not as common as some commercially oriented systems such as TSM** [16], Table 1 Storage requirements for selected scientific applications.

Scientific

discipline Near term Five years More than five years

Climate Currently there are several data repositories, each of the order of 20 to 40 TB.

Simulations will produce about 1 TB of data per simulated year. There will be several data repositories, each from 1 to 5 PB.

More detailed and diverse simulations. There will be several data repositories of the order of 10 PB each.

High-energy

physics Between 0.5 and 1.2 PB per experiment per year with five to ten experiments.

Need network rates of 1 Gb/s.

1 PB or more per experiment per year with five to ten experiments. Need network rates of 1,000 Gb/s.

Exabytes (1,000 PB) of data with wide-area networking more than 1,000 Gb/s.

Magnetic

fusion 0.5 to 1 TB per year with networking (for real-time steering and analysis) of 33 Mb/s per experimental site (three sites planned).

100 TB of data with network rates at 200 Mb/s per experimental site.

Hundreds of TB.

Chemistry Simulations produce 10 –30-TB

datasets. Each 3D simulation will

produce 30 –100-TB datasets. Large-scale molecular dynamics and multi-physics and soot simulations produce 0.2 to 1 PB per simulation.

Bioinformatics 1 TB. 1 PB.

215

(8)

SAMFS** [17], and VERITAS** [18], HPSS has been shown in the course of ten years of service to be effective, reliable, and highly scalable. It has replaced the Cray Data Migration Facility (DMF) [19] as arguably the most popular storage system used at supercomputing facilities.

HPSS was developed as a collaborative effort involving IBM, six of the national research laboratories, the DOE (Department of Energy) Lawrence Livermore, Los Alamos, Sandia, Lawrence Berkeley, and Oak Ridge national laboratories, and the NASA (National Aeronautics and Space Administration) Langley Research Center. It has been in production service since 1996 at several sites, and now is used for large-scale data repositories at more than 25 different HPC organizations.

While HPSS has many novel features, it is instructive to look at its design, evolution, and usage, since it is typical of systems that have to meet the requirements of deep computing sites.

HPSS design

HPSS is designed to move, store, and manage large amounts of data reliably between high-performance systems. The system provides very scalable performance that is close to the maximum of the underlying physical components and will track improvements in these

components into the future. It must be parallel in all regards in order to achieve the high-performance goals required to move terabytes of data in a reasonable time period. It provides security and reliability and supports a wide range of hardware technology that can be upgraded independently. Thus, it is modular in design and function and treats the network as the primary mechanism for data movement. Rather than designing a system that was tied to the computational resource, the collaboration realized that HPSS had to be designed as a modular system, but also had to stand alone as a system itself. The HPSS architecture follows the IEEE Storage System Reference Model, Version 5 [20]. This model was developed from the experiences of several older archive storage systems such as the MSS system developed at NASA Ames Research Center, the Common File Storage System developed at Los Alamos National Laboratory, and others.

HPSS treats all files as bit streams. The internal format of a file is arbitrary and is defined by the application or originating system. HPSS stores all of the bits associated with a fileset on physical devices that are arranged in a hierarchy according to physical performance characteristics. There can be any number of hierarchies, which usually consist of different-speed disk and tape devices. A storage hierarchy is a strategy for moving data Figure 2

The NERSC system for deep science computing. The system contains one or more very large computational platforms (in this case a 10-Tflop/s IBM SP), a set of small computational systems (in this case several IA-32 clusters that range up to hundreds of CPUs), and a very large data archive repository (here HPSS, consisting of eight STK robots and 15 TB of Fibre Channel cache disk, with a maximum capacity of almost 9 petabytes). The networking consists of a major local-area network as well as one or more major wide-area connections.

Visualization server – “Escher”

SGI Onyx 3400 – 12 processors Two Infinite Reality 4 graphics pipes

24 GB of memory 5 TB of disk

HPPS

12 IBM SP servers 30 TB of cache disk

Eight STK robots, 44,000 tape slots, 24 200-GB drives, 60 20-GB drives Maximum capacity 9 PB

Gigabit Ethernet Jumbo Gigabit Ethernet

IBM SP NERSC-3 – “Seaborg”

6,656 processors (peak 10 Tflop/s) 7.8 TB of memory 44 TB of shared disk

Ratio = (0.8, 4.4) LBNL “Alvarez” cluster

174 processors (peak 150 Gflop/s) 87 GB of memory 1.5 TB of shared disk

Myrinet^® 2000 Ratio = (0.6, 100)

PDSF 598 processors (Peak 833 Gflop/s) 598 GB of memory 55 TB of shared disk Gigabit and Fast Ethernet

Ratio = (0.8, 63) Symbolic

manipulation server

Test beds and servers

OC 48 – 2400 Mb/s Ethernet

10/100 Megabit SGI

STK robots

FC disk

ESnet HPSS

HPSS

Ratio = (RAM bytes per flop, disk bytes per flop)

216

(9)

between different storage classes. Storage classes may consist of a single storage technology, such as a single type of tape, or multiple types of media. For example, one class may be very-high-speed RAIDs (Redundant Arrays of Inexpensive Disks) with parallel hardware interfaces, while another class is composed of slower, cheaper disks with more capacity. Most sites have one or more storage classes that use tape as the storage media. Often the tape is automatically managed with robot tape libraries such as those from StorageTek¹or IBM.

A typical HPSS configuration is shown in Figure 3. Data flows over the network to a cache disk storage device.

Then, following the site policy, one or more copies of the data move to lower storage classes, which presumably consist of cheaper but slower storage devices. Once data moves, it is deleted from the original storage class, making room for new data to move to the higher storage class.

This process is called migration.

It is also possible to segregate types of files into different hierarchies. For example, very large files may be handled by placing them on very fast, very expensive disks, but then migrating them to tape media designed to hold large amounts of data. HPSS provides both serial and parallel access to data stored in its storage classes.

HPSS functions

In order to make HPSS work, a number of functions are implemented by servers:

● The mover manages the transfer of bits from one device to another. Devices may be tape drives, disk drives, network interfaces, or any other media. HPSS can support third-party transfers, in which the mover just manages the transfer and the data flows directly from source to destination.

● Devices and data paths are prepared for data movement by a set of core servers. One core server is the name server, which maps a human-readable file name to a system-generated bitfile ID. The name server also manages relationships among files, which may be grouped and joined together into directory hierarchies like a standard file system.

● Once the name of a file is mapped to a file ID, it must be mapped to the class of physical media on which it resides. The storage server does this. The storage server finds the actual physical device (say, a tape in a tape robot) and gets it mounted so that the transfer can begin.

● The physical volume library (PVL) manages all of the physical media in a system. It works with the physical volume repository (PVR) and other software and hardware to locate tapes and cause them to be mounted

in the appropriate drives so that the mover can access them. The PVRs issue the tape library commands that manage the actual media in the system.

● The migration manager is responsible for migrating data from one medium to another. This includes moving data from disk to tape to free up disk space, and migrating data from one tape to another (often more dense) tape.

Tape migration is done to consolidate tapes as files are deleted and to move data from old to new tape media, thus allowing automatic conversion.

● The storage system manager allows operators and administrators to manage the storage system. It provides a GUI (graphical user interface) which displays the status of servers, tasks, and resources. It passes commands to the system manager to perform operations on HPSS components, including allocation and

configuration of resources, and initialization and termination of tasks.

● A location server provides a mechanism for clients to locate servers and gather information about HPSS systems.

HPSS provides multiple ways to interface with the system. The most basic are the File Transfer Protocol (FTP) and a high-performance parallel version (PFTP) created for HPSS. Another interface, HSI (Hierarchical Storage Interface), was developed to support more user-

1StorageTek Corporation, Louisville, CO.

Typical of HPSS configurations for serving high-performance scientific centers, the NERSC HPSS system consists of a complex mesh of hardware and software, tied together with the HPSS software.

Figure 3

Network Gigabit Ethernet Jumbo Gigabit Ethernet Multiple connections

to supercomputers

and other hosts

12 SP nodes;

three racks Core server Disk/tape mover

Core server Disk/tape mover Disk/tape mover Tape mover

Tape mover Tape mover Tape mover Disk mover Disk mover

Disk mover

Database Disk arrays Disk arrays Disk arrays

Disk cache 30 TBs

Tape robots 44,000 tapes 3-9 petabytes

9940 9940

9840 9840

217

(10)

friendly commands and graphic user interface interactions.

HPSS also provides the ability to export HPSS data as network file systems (NFSs), distributed file systems (DFSs), and extended file systems (XFSs). HPSS supports the Data Migration Application Programming Interface (DMAPI) and is currently implementing standard grid interfaces.

Security was a major design goal for HPSS. Its components communicate in an authenticated and, when necessary, encrypted manner. It has auditing abilities beyond standard UNIX** and can enforce features such as access control lists.

The performance of HPSS has been demonstrated to be highly scalable, and the system is highly reliable, even while it handles 20 million or more files and petabytes of data. Transfer rates to and from HPSS run up to 80% of the underlying media rate, including Gigabit Ethernet.

Technologies needed for unified file systems An emerging issue for many sites involved with deep computing is the waste and inefficiency surrounding the use of on-line disk storage. Currently, each high- performance system must have a large amount of local disk space to meet the needs of applications with large data sets. This file space is limited, and often applications and projects must live within quotas. When users access different machines, they typically make complete copies of the application and data needed. Thus, having separate storage on each machine is inefficient both in storage and in productivity. The main reason why it persists as the norm is that only local storage provides the I/O rates needed by deep applications. While file systems such as NFS and other distributed file systems are convenient, they lack the high performance and scalability required.

Fortunately, the confluence of several technologies is making it possible to address this problem more robustly.

At the base technology level, new disk storage devices and Fibre Channel fabric switches make it possible to attach a single device to multiple systems. Network-attached storage (NAS) and storage-area networks (SANs) provide some fundamental building blocks, although not many operate in a truly cross-vendor manner yet. Finally, shared or cluster file systems provide a system-level interface to the underlying technology. Combinations of the new technologies are beginning to approach the performance rates of locally attached parallel file systems.

File system technologies

Without reliable, scalable, high-performance shared file systems, it will be impossible to deploy the center-wide file systems needed at deep computing sites to support activities such as interactive steering and timely visualization.

There are currently two major approaches for sharing

storage between systems: network-attached storage (NAS) and storage area networks (SANs).

Network-attached storage is a general term for storage that is accessible to client systems over general-purpose IP networks from the local storage of network-attached servers. Since data transfers between storage servers and clients are performed over a network, such file systems are limited by network bandwidth, network protocol overhead, the number of copy operations associated with network transfers, and the scalability of each server. The file system performance is often constrained by the bandwidth of the underlying IP network.

Storage area networks provide a high-performance network fabric oriented toward block storage transfer protocols and allow direct physical data transfers between hosts and storage devices, as though the storage devices were local to each host. Currently, SANs are implemented using Fibre Channel (FC) protocol-based high-performance networks employing a switched any-to-any fabric. Emerging alternative SAN protocols, such as iSCSI and SRP (SCSI RDMA Protocol), are enabling the use of alternative fabric technologies, such as Gigabit Ethernet and Infiniband, as SAN fabrics. Regardless of the specific underlying fabric and protocol, SANs allow hosts connected to the fabric to directly access and share the same physical storage devices. This permits high- performance, high-bandwidth, and low-latency access to shared storage. A shared-disk file system will be able to take full advantage of the capabilities provided by the SAN technology.

Sharing file systems among multiple computational and storage systems is a very difficult problem because of the need to maintain file system coherency through the synchronization of file system metadata operations and coordination of data access and modification. Maintaining coherency becomes very challenging when multiple independent systems are accessing the same file system and physical devices.

Shared file systems that directly access data on shared physical devices through a SAN are commonly categorized as being either symmetric or asymmetric. Asymmetric shared file systems allow systems to share and directly access data, but not metadata. In such shared file systems, the metadata is maintained by a centralized server that provides synchronization services for all clients accessing the file system. Symmetric shared file systems share and directly access both data and metadata. Coherency of data and metadata is maintained through global locks which are maintained either by lock servers or through distributed lock management performed directly between participating systems.

Shared file systems are not common, because

implementing them is inherently difficult. Implementations of asymmetric unified file systems are the more common 218

(11)

of the two types, because centralized metadata servers are easier to implement than distributed metadata management. However, symmetric shared file systems promise better scalability without the bottleneck and single-point-of-failure problems inherent in the asymmetric types.

Storage technologies

A new category of storage systems called utility storage is being introduced by new storage vendors such as 3PARdata** [21], Panasas ActiveScale Storage Cluster**

[22], and YottaYotta** [23], who promise to deliver higher levels of scalability, connectivity, and performance.

These storage systems use parallel computing and clustering technology to provide increased connectivity and performance scalability, as well as redundancy for higher reliability.

The storage provided by utility storage can scale from a few dozen to a few hundred terabytes, or even in the petabyte range, with a transfer rate as fast as a few thousand MB/s and the ability to sustain beyond 100,000 I/O operations per second (IOPS).

SAN fabric technologies

Today, most shared file systems either are very slow or use a proprietary interconnect from a single vendor. In the latter case, the file system is limited to the products of that vendor, or more often to a subset of the vendors products. For a deep computing facility, that is not sufficient. Recently, several viable interconnect fabrics have started to emerge that may allow the introduction of high-performance, heterogeneous storage. While this is a positive sign, it also means that in most cases a site will be dealing with multiple interconnect fabrics that are bridged together. Such environments will require that the fabric bridges operate efficiently and without introducing large latencies in bridged communications.

Fibre Channel technology has recently undergone an upgrade from 1-Gb/s to 2-Gb/s bandwidth, with 4-Gb/s and 10-Gb/s bandwidths on the near-term horizon.

These increases will allow substantially improved storage performance. Fibre Channel is also showing substantially improved interoperability between equipment from different vendors in multiple-vendor SANs.

Using Ethernet as a SAN fabric is now becoming possible because of the iSCSI standard [24]. The iSCSI protocol is a block storage transport protocol. The protocol allows the standard SCSI packets to be enveloped in Ethernet packets and transported over standard IP infrastructure, thus allowing SANs to be deployed on IP networks. This is very attractive, since it allows SAN connectivity at a lower cost than can be achieved with Fibre Channel, although also with lower performance. The iSCSI protocol will allow large numbers of inexpensive

systems to be connected to the SAN and use the shared file system through commodity components.

The emerging Infiniband (IB) interconnect [25] shows promise for use in a SAN as a transport for storage traffic.

Infiniband offers performance (both bandwidth and latency) beyond that of either Ethernet or Fibre Channel, with even higher bandwidths planned. The current 4⫻

Infiniband supports 10-Gb/s bandwidth, while 12⫻

Infiniband with 30-Gb/s bandwidth is poised for release.

However, beyond demonstrating the ability to meet the fundamental expectations, advanced storage transfer protocols (e.g., SRP) and methodologies for Infiniband technology have to be developed and proven, as do fabric bridges between Infiniband, Fibre Channel, and Ethernet SANs.

Unified file system architectures

Several promising new unified file system architectures have been designed for high-performance cluster environments, typically with some kind of high-speed interconnect for the messaging traffic. Many of the new architectures perform storage transfers between client nodes and storage nodes over the high-speed interconnect.

Currently there are several major projects at

supercomputing centers that are addressing the issues of unified file systems. One is the Global Unified Parallel File System (GUPFS) project at NERSC [26] (Figure 4) and another is the ASCI Pathforward Scalable Global Secure File System (SGSFS) project [27] at Lawrence Livermore National Laboratory.

Storage virtualization has been widely used as a way to provide higher capacity or better performance.

Virtualization can also be implemented at the file system level. File system virtualization allows multiple file systems to be aggregated into one single large virtual file system to deliver higher performance than a single NFS or NAS server could provide. A few examples of federated file systems include the IBM General Parallel File System [28], Lustre [29], the Hewlett-Packard DiFFS [30], the Maximum Throughput InfinARRAY [31], and the Ibrix SAN-based file system [32].

Figure 5shows a potential architecture for the federated file systems using the high-speed interconnect for the storage traffic.

R&D challenges for unified file systems

The technologies needed for unified file systems at deep computing sites face a number of research and development challenges regarding scalability, performance, and interoperability.

One of the major storage issues is the degree of efficiency with which the storage can handle I/O requests from multiple clients to individual shared devices (each a

logical unit number, or LUN). The scalability of single- 219

(12)

device access has been a common problem with many storage systems, both in the number of initiators (clients) allowed on a single device and in the performance of shared access. On most storage systems, the maximum number of client initiators allowed on a single LUN has been less than 256; deep computing sites will require storage devices to support thousands of simultaneous accesses. The envisioned GUPFS implementation, for example, is a shared-disk file system with tens of thousands of client systems. That system must provide high performance of individual components and

interoperability of components in a highly heterogeneous, multi-vendor, multiple-fabric environment. The ability of the file systems to operate in such a mixed environment is very important to the ultimate success of a useful global unified file system for deep computing.

In addition to simply supporting very large numbers of simultaneous accesses, storage devices must be able to efficiently recognize and manage access patterns. Current storage devices are unable to do so, and simultaneous

sequential accesses by even a few tens of clients appear as random access patterns, resulting in inefficient cache management. New scalable cache management strategies will be required.

To adequately support thousands of clients, storage devices will have to be able to deliver tens to hundreds to thousands of GB/s of sustained bandwidth, employing multiple SAN interfaces. To operate effectively in the multiple-SAN/interconnect-fabric environment expected at deep computing sites, the storage devices will have to support multiple types of fabric interfaces simultaneously (e.g., Fibre Channel, Infiniband, and Ethernet interfaces) through building block modules.

In order to support tens of GB/s sustained I/O rates for very large numbers of clients, SAN fabric switches with very large numbers of ports must be fielded in order to minimize the ports needed for the fabric mesh. This in turn requires fabric switches with much higher (tens of GB/s) interswitch link capabilities to facilitate link aggregation to the high-performance storage devices.

The Global Unified Parallel File System (GUPFS) being developed at NERSC. Currently, each major system in a facility must have large amounts of local disk space in order to achieve the performance levels needed for deep science. In the next four to five years, emerging technology should enable high-performance parallel file systems that will allow a large amount of persistent storage to be shared while maintaining high-performance I/O transfer rates.

Figure 4

LAN

Tape Current

GUPFS

User storage WAN

LAN

Functionality/Priority 1. NERSC facility 2. HPSS integration 3. Grid integration and remote distribution

WAN Tape WAN

SAN fabric

GUPFS GUPFS GUPFS GUPFS

HPSS HPSS

/u1..._/u_N

/scratch

/project ..._/project_N

SAN fabric GUPFS

...

220

(13)

Although storage and fabric vendors are beginning to recognize and address these issues, it will take significant research and development to support the needed functionalities.

Managing large datasets in shared storage resources

In the previous sections we introduced archiving technologies and methods of sharing on-line disk space between computing servers. These systems provide the basic components we need to begin to address the application phases mentioned in the introductory sections of this paper, but they are not sufficient. What is still needed is the software to manage the movement of the data between the storage media and to the computing servers.

Consider an analysis application, such as the analysis of transaction data of a grocery store for the purpose of decision-making. If the amount of data is small, the analysis program loads all of the data into memory and performs the analysis in memory. If the amount of data does not fit into memory, it is brought piecewise from a disk cache. If the level of computation is high enough that it is not slowed down by access from disk, there is no loss in efficiency. Various optimization techniques can be used, such as methods to reorganize the data according to the access patterns, or using indexes to minimize the access time or disk I/O. Another technique is to speed up access from disks by providing file systems that use parallel striping methods, such as General Parallel File System (GPFS) or Parallel Virtual File System (PVFS) [33]. Such techniques have been the subject of many studies in the domain of data management research. Our purpose in this section is not to cover the above techniques, but rather to discuss the problems that arise if the amount of data is so large that it does not fit on disk, but must be stored in archival storage. For example, the amount of scientific data generated by simulations or collected from large- scale experiments has reached levels that cannot be stored in the researchers workstation or even in a local computer center. Access to data is becoming the primary bottleneck in such data-intensive applications. This is a new class of problems, because access from archival tertiary storage is relatively slow, and other optimization methods must be applied. What can be done to manage data stored in tertiary storage systems more efficiently?

Suppose one has 1 TB in tertiary storage that one wishes to analyze, using a supercomputer that can analyze the data in parallel. Getting the data to the supercomputer requires downloading the data from tape to disk. Assuming that there is available a parallel disk system with 100 disks and that each disk can read data at 10 MB/s, data can be streamed to the supercomputer at a rate of 1 GB/s. Thus, the data can be read by the

supercomputer in 1,000 seconds, which is reasonable for analyzing this quantity of data. However, in order to achieve this rate from tertiary storage, 100 tape drives, each reading at 10 MB/s, would have to be dedicated to this task for 1,000 seconds. This is not a very practical solution, especially considering datasets in the petabyte range. Also, with the tapes available today, each tape may hold 50 –200 GB, and multiple tapes may have to be loaded, adding latency as a result of mounting and dismounting tapes. In this scenario the robotic tape system is the bottleneck. What techniques can be applied? We now discuss several.

1. Overlap data processing with staging data to disk.

In many applications, it is possible to start processing subsets of the data. For example, one can start analyzing transaction data piecewise, such as transactions for a day at a time. Assuming that data is organized in large files, one can devise software that will stream data to the analysis program and concurrently continue to stage files from tape. Streaming data to the analysis programs is also an effective way of sharing disk cache with many users concurrently. Even if each user needs most of the space of the disk cache, it is not necessary to move all A potential architecture for integrating a large global file system with federated components.

Figure 5

Process 1

Process 2

Process N

User space Kernel space Parallel I/O library

Virtual file system (VFS) switch Federated file system (client)

Federated file system (server) Message-passing

interface

Buffer cache

Buffer cache Device driver (IP, GM)

Gigabit Ethernet switch or high-speed interconnect

Storage node 1

Storage node M Switch (FC, iSCS, or IB

1 2 3 4 5 6 7 8 Controller

Client 1

Client 2

Client N Storage node

Message- passing interface (IP, GM) driver

FC device driver

221

(14)

of the data to disk for exclusive use and schedule users to operate sequentially. Instead, each user gets part of the disk cache, and the data is streamed to all of the applications concurrently. To support this streaming model, it is important that a user release files already processed as soon as possible, so that quota space can be reused. This requires systems that support the concept of pinning and releasing files. Pinning a file guarantees that the file will stay in the shared disk cache for a period of time. If the file is released within this period of time, the space can be reclaimed and reused. Otherwise, at the end of that time, the file will be released by the system.

2. Observe incremental results on the streaming data with increasing accuracy.

The streaming model discussed above also provides opportunities to end analysis or simulation tasks early, thus saving a tremendous amount of computing and storage resources. Often, the analysis produces statistical results, such as histograms. These can be obtained incrementally as the data streams through the analysis programs. This permits the client to observe the data statistics as they are generated with increasing accuracy.

When a sufficient level of accuracy is achieved, the client can stop the analysis. Given that proper sampling of the data is performed (e.g., applying random sampling of files from tapes), it is sufficient to read only 5–10% of the data in most applications to achieve the desired accuracy. This idea has been exploited in the database community for online query processing, but it is even more important for the analysis of massive amounts of data streaming out of robotic tape systems. A similar idea can be applied for the processing of large simulations. Some scientific simulations such as astrophysics or climate modeling may take many days of computation and produce several terabytes of data each. It is important to identify at an early stage whether a simulation is progressing correctly. Again, statistical analysis and visualization techniques can be used to observe the progress of the simulation.

3. Share “hot” data that is brought to disk.

Another opportunity for using disk caches effectively to reduce reading data from tape is to share files among users. This is the case when a community of users share disk caches to analyze the same very large dataset. Often, a user will access the same files repeatedly, or different users interested in the same aspect of the data will request files that another user previously staged to disk from tape.

Such files are referred to as hot files. In order to share hot files, one must have software to track file usage and

“smart” predictive algorithms to determine which files to evict when space is needed.

4. Minimize tape mounts.

One of the most costly aspects of dealing with robotic tape systems is the time it takes to mount a tape. This is typically in the range of 20 – 40 seconds with current technology. Another latency problem is searching for a file on a tape, which can also take 20 – 40 seconds depending on the tape size and the search speed. Avoiding these delays results in a large performance gain. To achieve this benefit, a scheduling system can be used that has the flexibility to stage files in tape-optimized order over many clients. Such a scheduling system must also ensure that no one job is postponed unfairly. The scheduling system has to have information on file location on tapes (i.e., which tapes, and location on the tape). The HPSS system mentioned above is designed with some minimal optimizations for submitted file requests. However, it does not have a global view of the files needed for an entire job. If entire jobs are provided to a front-end system for many users, that system may be able to perform global optimization ahead of time by grouping file access requests in a tape-optimized order.

5. Use indexing techniques for finding desired subsets of the data.

Another important technique for reducing the amount of data that is read from tape systems is having information on the content of the files, so that only the relevant files are accessed for a given analysis. This problem is so severe in some application domains, such as high-energy physics, that subsets of the data are pre-extracted by performing full scans of the very large datasets and selecting the subsets of interest. Many such subsets are generated, which only adds to the replication of the data and the use of more storage. Indexing techniques can eliminate the need to scan the data many times to generate desired subsets. For example, in high-energy physics, the objects stored in files are chunks of data, 1–10 MB each, called events, representing high-energy collisions of particles. It is possible to obtain a set of properties for each event, such as the energy of the event and the types of particles produced by that event. There are billions of such events stored in thousands of files. An index of the event properties can identify the files of the desired events for the analysis, and then only those files will be accessed.

6. Automate garbage collection.

As mentioned above, one of the biggest problems with shared storage is waste caused by moving files into the storage spaces but never removing them. It is impossible to track which file is really valuable, and ineffective to force users to “clean up.” Another approach is to have shared spaces managed by letting space be used dynamically. The main concept is that a file is brought to the shared disk cache on a temporary basis with some 222