PFRF: An adaptive data replication algorithm based on star-topology data grids

Ming-Chang Lee^a,Fang-Yie Leu^b,Ying-ping Chen^a,^∗

aDepartment of Computer Science, National Chiao Tung University, Taiwan

bDepartment of Computer Science, Tunghai University, Taiwan

a r t i c l e i n f o

Article history:

Received 5 November 2010 Received in revised form 5 July 2011

Accepted 5 August 2011 Available online 6 November 2011

Keywords:

Data grid Data replication Data access patterns File popularity PFRF

a b s t r a c t

Recently, data replication algorithms have been widely employed in data grids to replicate frequently accessed data to appropriate sites. The purposes are shortening file transmission distance and delivering files from nearby sites to local sites so as to improve data access performance and reduce bandwidth consumption. Some of the algorithms were designed based on unlimited storage. However, they might not be practical in real-world data grids since currently no system has infinite storage. Others were implemented on limited storage environments, but none of them considers data access patterns which reflect the changes of users’ interests, and these are important parameters affecting file retrieval efficiency and bandwidth consumption. In this paper, we propose an adaptive data replication algorithm, called the Popular File Replicate First algorithm (PFRF for short), which is developed on a star-topology data grid with limited storage space based on aggregated information on previous file accesses. The PFRF periodically calculates file access popularity to track the variation of users’ access behaviors, and then replicates popular files to appropriate sites to adapt to the variation. We employ several types of file access behaviors, including Zipf-like, geometric, and uniform distributions, to evaluate PFRF. The simulation results show that PFRF can effectively improve average job turnaround time, bandwidth consumption for data delivery, and data availability as compared with those of the tested algorithms.

1. Introduction

Generally, a data grid, a specific grid system that provides users with a huge amount of storage space, often maintains a high volume of distributed data to serve users. Many recent large-scale scientific systems [1–4] and commercial applications [5], e.g., the Biomedical Informatics Research Network (BIRN) [6], the Large Hadron Collider (LHC) [7], the DataGrid Project (EDG) [8], and physics data grids [9,10], have collected a huge amount of data and performed complex experiments and analyses on the data [11–13].

According to the Pareto principle (also known as the 80/²⁰ rule) [14], a part of data grid files is frequently accessed and transferred. If a file has no replicas distributed over the data grid, the efficiency of accessing the file is often poor since long distance data transfer always occupies a lot of bandwidth and causes long transmission delays [15]. Hence, how to decrease data access latency, lower bandwidth consumption for data transmission, and improve data availability have been the key issues in data grid research [16]. Data replication is a general and simple approach to achieve these goals. It has been widely used in many areas, such as

∗Corresponding author.

E-mail addresses:mingchang1109@gmail.com(M.-C. Lee),leufy@thu.edu.tw (F.-Y. Leu),ypchen@nclab.tw,ypchen@cs.nctu.edu.tw(Y.-p. Chen).

the Internet, peer-to-peer systems, and distributed databases [17–

21]. A well-defined data replication method should meet the requirements of being able to determine an appropriate time to replicate files, decide which files should be replicated, and store these replicas in appropriate locations [15,16,22–24].

On the other hand, the analyses of data access patterns have been the critical steps in designing efficient dynamic data replication schemes [25–27]. Several distributions have been used to model data access patterns, defined as the distribution of access counts on files of a system, and file popularity, defined as how often a file is accessed by users, i.e., how popular a file is [28,29]. Breslau et al. [28] claimed that using the Zipf-like distribution can more accurately model the distribution of webpage accesses. Cameron et al. [29] showed that the distribution of file accesses in data grids follows the Zipf-like distribution. Ranganathan and Foster [22, 30] claimed that the geometric distribution can properly model file access behaviors and the property of temporal/geographical locality.

Further, Ranganathan and Foster [31] derived file popularity by using both Zipf and geometric distributions on a multi-tier data grid with unlimited storage space. Tang et al. [23] also used Zipf-like and geometric distributions to simulate users’ file access behaviors on a multi-tier data grid. Chang et al. [32,24] proposed two data replication strategies on a cluster-based data grid with limited storage. However, the strategy they proposed in [32] did

doi:10.1016/j.future.2011.08.015

Author's personal copy

1046 M.-C. Lee et al. / Future Generation Computer Systems 28 (2012) 1045–1057

not consider the data access pattern. Hence, it might lead to inefficient data access as the users’ access pattern changes; the strategy proposed in [24] only replicates the file most frequently accessed in the last time period, consequently resulting in long file transmission delays for those files with similar but low weights.

In this study, we propose an adaptive data replication algorithm, called the Popular F ile Replicate F irst algorithm (PFRF for short), which is developed on a star-topology data grid with limited storage space. A star-topology data grid is a simplified tree-topology data grid with a central cluster that connects all other clusters. A link l between two arbitrary clusters will go through the central cluster, and l might comprise several routers, and physical links. Directly evaluating the components of l is difficult since too many analytical items might be involved. Hence, this study treats l as a logical link to simplify the original topology as a whole [33, 34]. The simplification process will be proposed. To adapt to the changes of users’ interests in files, the PFRF aggregates file access information and replicates popular files to suitable clusters/sites.

We simulate several cases in which file popularity follows a Zipf-like distribution, geometric distribution, and uniform distribution under the assumption that user behaviors vary with the changes of users’ interests. The simulation results show that PFRF provides users with a system that has higher data availabilities, lower data transmission delays, and less bandwidth consumption for data access.

The rest of the paper is organized as follows. Section 2 introduces background and related work of this study. Section3 describes the architecture of a star-topology data grid and the details of the PFRF. Simulation results are presented and discussed in Section4. Section5 concludes this article and addresses our future research.

2. Background and related work

In this section, we describe the architectures of data grids and several existing replication strategies and algorithms.

2.1. Data grid architecture

Data grids can be classified into multi-tier data grids, first proposed by the MONARC project [35], and cluster data grids, initially introduced by Chang et al. [32]. The multi-tier data grid architecture in which a leaf node represents a user or a computational node, and internal nodes are resource sites keeping sharable files. In this architecture, a file held by a site will also be held by all its ancestor sites. Therefore, the root site holds all files stored in the data grid. When an end user requires a file F which does not exist in his/her site, the user requests F from its immediate ancestor. If the ancestor does not have the file, it in turn requests F from its immediate ancestor. The process repeats until the user obtains the file from a node which holds the file. After that, the file will be replicated to all the nodes on this requesting path following the reverse direction of the requests. It is clear that file access latency can be reduced in a multi-tier data grid, but it leads to higher storage cost since files will be redundantly stored in multiple locations.

A cluster data grid consists of n clusters connected by the Internet [24]. Files are stored in these clusters. Each cluster has a header node (a header for short) responsible for managing site information and exchanging file access information with other cluster headers. A header periodically determines which file should be replicated by computing file weights. After that, the file with the highest weight will be replicated to clusters that need the file.

Sites in these clusters can then locally and quickly retrieve the file.

Compared with a multi-tier data grid, a cluster data grid consumes less storage to hold files.

2.2. Existing data replication algorithms/strategies

Least Frequently Used (LFU) [36] and Most Frequently Used (MFU) [36] are two simple dynamic replication strategies widely used in many areas, such as disk and cache memory duplication. If a storage device has insufficient space to hold a new file, LFU (MFU) will be invoked to choose the files that have been the least (most) frequently used as the victims to make room for the new one. In the experiments of this study, MFU and LFU are both involved, called the MFU/LFU strategy (M/LFU for short) in which MFU is used to choose the most frequently used files and LFU is employed to select victims once the destination cluster has insufficient storage space to save the replicated files.

Ranganathan and Foster [22] presented six replication/caching strategies for a multi-tier data grid: No Replication or Caching, Best Client, Cascading Replication, Plain Caching, Caching plus Cascading Replication, and Fast Spread, and three types of localities: temporal locality, geographical locality, and spatial locality. The experimental results showed that the Fast Spread and Cascading Replication outperform the other four strategies and their file access latencies are shorter than those of the other four strategies. They also found that Fast Spread (Cascading) is better when the data access pattern is random (geographical locality).

However, the six strategies cannot avoid the disadvantages of a multi-tier data grid, i.e., a file may be redundantly stored in a multi-tier. In fact, the storage space utilization and access latency are a trade-off [32]. Ranganathan and Foster [31] also proposed a suite of job scheduling and data replication algorithms for a multi-tier data grid and evaluated the performance of different combinations of the replication and scheduling strategies. One of the data replication algorithms, called DataRandom (DR for short), replicates a file when the corresponding access frequency exceeds a pre-defined threshold. Although DR is designed for an unlimited storage environment, it can also be run on a limited storage environment. DR is therefore involved in the experiments of this study.

Tang et al. [23] introduced Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU) algorithms to reduce the average data access response time for a multi-tier data grid. The basic idea of the two algorithms is to replicate a file to sites close to its requesting clients when the file’s access rate is higher than a pre-defined threshold. SBU considers the file access history for individual site, but ABU aggregates the file access history for a system. With ABU, a node sends aggregated historical access records to its upper tiers, and the upper tiers do the same until these records reach the root.

Due to the aggregation capability, ABU has a shorter job response time and less bandwidth consumption than those of SBU.

Khanli et al. [37] proposed an algorithm called Predictive Hierarchical Fast Spread (PHFS), which is an extended version of fast spread [22], in a multi-tier data grid. PHFS utilizes spatial locality [22,38] to predict data files required in the future, and pre-replicates these files to suitable sites to improve the performance of file accesses. Kunszt et al. [39] presented a replica management grid middleware to reduce file access/transfer time. Their experimental results showed that this middleware significantly reduces wide area transfer times. However, this model was developed for multi-tier data grids with unlimited storage space.

Chang et al. [24,32] presented two dynamic replication strate-gies, Latest Access Largest Weight (LALW) [24] and Hierarchical Replication Strategy (HRS) [32], on cluster-based data grids. LALW utilizes the half-life concept to evaluate file weights. A file with a higher access frequency has a larger weight. Their experimental results show that LALW outperforms LFU and no-replication data replication strategies [22] in network utilization and efficiency.

However, LALW only replicates the most popular file in each time

Author's personal copy

M.-C. Lee et al. / Future Generation Computer Systems 28 (2012) 1045–1057 1047

(a) A tree-topology data grid. (b) A star-topology data grid.

Fig. 1. The architectures of a tree-topology data grid and a star-topology data grid.

Fig. 2. (a) A physical link between site A and site B. (b) A logical link between site A and site B.

interval. Hence, the transmission delays of those files with similar but lower weights are still long. HRS, which is an extended version of BHR (Bandwidth Hierarchy based Replication) [40], aims to re-duce expensive cluster-to-cluster replica transmission. Whenever the file is required, but cannot be retrieved from the local cluster, HRS replicates the file to a local site from a remote cluster. How-ever, HRS does not consider data access patterns, and thus it might not be able to adapt to changes of user access behaviors and pro-vide efficient data accesses.

3. System framework

The proposed data grid as shown inFig. 1consists of a global replica controller (GRC) and several clusters connected to the GRC through the Internet. As illustrated inFig. 1(a), each connection comprises several routers and links, forming a tree-topology data grid rooted at the GRC. In other words, a file transmitted between two clusters will go through the routers and links on the two connections being considered. For example, when a site X in a cluster i issues a file access request to the GRC to request a file F stored in site Y which is a member of cluster e, the GRC then requests Y to deliver F to X . F then goes through routers and links between clusters i and e. Basically, it might not be easy to analyze the performance of the file transmission since in such a tree-topology too many network components and environmental parameters are involved. To simplify the evaluation of file transmission, in this study, we reduce a tree-topology data grid to a star-topology (seeFig. 1(b)).

3.1. Tree-to-star reduction

As shown inFig. 2(a), site A connects to site B through link l₁, router r₁, link l₂, router r₂, and link l₃. The corresponding physical path is denoted by l_{A_l}₁_{_r}₁_{_l}₂_{_r}₂_{_l}₃_{_B}. The bandwidth of the path is

|P|

T_l1+W_r1+T_l2+W_r2+T_l3 in which|P|is the size of a packet P delivered through the path, T_l_i = ^|_B^P^|

li is the transmission delays of link l_i, W_r_j = _µ ¹

rj−λ_rj is the service (queueing) delay of router r_junder the assumption that r_jis an M/M/1 queueing model where B_l_i is

the bandwidth of l_i,ⁱ= 1, ^{2, and}λrjandµrj are respectively the arrival rate and departure rate of r_j, ^j = 1,²,^{3. l}A_B shown in Fig. 2(b) is the logical link of l_{A_l}₁_{_r}₁_{_l}₂_{_r}₂_{_l}₃_{_B}. Its bandwidth is _T^|^P^|

A_B

where T_{A_B}is the transmission delay of P from A to B. It is clear that T_{A_B} = T_l₁ +W_r₁ +T_l₂ +W_r₂ +T_l₃. Therefore, we can conclude that l_{A_l}₁_{_r}₁_{_l}₂_{_r}₂_{_l}₃_{_B} can be reduced to a logical link l_{A_B}, and the tree-topology data grid shown inFig. 1(a) could be reduced to a star-topology data grid shown inFig. 1(b). Now a star-topology with the performance equivalent to that of the tree-topology can be obtained.

3.2. GRC and LRC

In the proposed architecture, each cluster comprises sites and a local replica controller (LRC). They are connected by a LAN or LANs.

The LRC maintains a local replica table (LRT ) includes filename, file location, access count, file weight, and master file fields, to record file access information. A master file is an original file that cannot be deleted from the data grid. File weight is the popularity of the file. Its calculation will be described later. File access count shows the frequency that the file is accessed by sites within a cluster. All master files are distributed to sites of different clusters.

The GRC as a central server located somewhere in the Internet is responsible for aggregating file access records for all clusters and determining which files should be replicated to which clusters.

The GRC maintains a global replica table (GRT ) to collect the information recorded in LRT s. When the GRC decides to replicate a file to a cluster, it records the location of the new replica in GRT so that some time later when a LRC requests the location of the file, it can answer the LRC accordingly. Similarly, the cluster holding this new replica will record the related information in its LRT.

3.3. Zipf-like distribution and geometric distribution

To achieve a better file access performance, we need to keep track of users’ file access behaviors to accordingly predict which files will be accessed frequently in the near future. The prediction is a main task of a data replication algorithm/strategy based on the assumption of temporal locality [22] in which a popular

Author's personal copy

1048 M.-C. Lee et al. / Future Generation Computer Systems 28 (2012) 1045–1057

Fig. 3. PFRF data replication algorithm.

file will be accessed more frequently than unpopular ones [23].

Breslau et al. [28] showed that webpage requests follow a Zipf-like distribution [29,41] derived from Zipf’s law [42]. In the Zipf-like distribution, the access probability of the i-th most popular file, denoted by P(^fi)^{, is}

P(^fi) =¹/ⁱ^α ⁽¹⁾

where i=1,², . . . ,^{n and}αis a factor determining the file access distribution, 0≤α <^1.

Ranganathan and Foster [30,31] adopted the geometric distri-bution to simulate file popularity in which the access probability of the i-th most popular file, denoted by P(ⁱ)^{, is}

P(ⁱ) = (¹−p)ⁱ⁻¹·p (2)

where i= 1,², . . . ,^{n and 0}< ^p <1. A larger p represents that a smaller portion of files has been frequently accessed. As stated above, we assume that our users’ access behaviors follow either Zipf-like or geometric distributions with different parameters.

3.4. Popular file replicate first (PFRF) algorithm

The PFRF algorithm, as illustrated inFig. 3, is performed by the GRC at the end of a round, where a round is a fixed time period T_d in which y jobs, y ≥0, are submitted by users from each cluster.

A job might require several files as its input data. The algorithm comprises four phases: file access aggregate phase, file popularity calculation phase, file selection phase, and file replication phase.

1. File access aggregate phase: Between lines 2 and 5 of the algorithm, PFRF aggregates the access count for each file f_i stored in cluster c at round r, denoted by A^r_c(^fi), sorts all the files on A^r_c(^fi)s in a descending order, and stores the sorted result into a set S. After that, PFRF calculates the total number of files having been accessed by all sites in cluster c at round r, denoted

by TNF^r_c, based on the information stored in LRT_C. Note that 1 ≤ i≤ N_k, and 1 ≤c ≤N_cwhere N_kis the number of files in cluster c in round r, and N_cis the number of clusters that the data grid has, and r =1,²,³, . . .^.

2. File popularity calculation phase: In line 6, PFRF calculates a popularity weight for file f_i, denoted by PW^r_c(^fi)^,

PW^r_c(^fi) =

PW^r_c⁻¹(^fi) +^A^rc(^fi) ·^a, ^{if A}^rc(^fi) >⁰ PW^r_c⁻¹(^fi) −^b, ^otherwise ,

r≥1,^c≥1,ⁱ≥1. ⁽³⁾

where a and b are constants and a <b. The reason why a<^b is described later. If A^r_c(^fi) >^{0, i.e., f}ihas been accessed by users in round r, PFRF increases PW^r_c⁻¹(^fi)^{by A}^rc(^fi) ·a. Otherwise, it decreases PW^r_c⁻¹(^fi)by b. Basically, a higher PW^r_c(^fi)^implies that f_iis more popular. We assume that in round 0 all files follow the binomial distribution, i.e., PW⁰_c(^fi) =0.5, which means that the initial access probability of f_iis 0.5. Note that the minimum value of each PW^r_c⁻¹(^fi)is 0. From previous access records of f_i, PFRF derives the variation of the popularity of f_iand predicts the popularity of f_ifor the next round, where 1 ≤ _i≤_N_k_{. For} instance, if f₃has been accessed 5 times by cluster 2 in round 1, PW¹₂(^f3) = ⁰.⁵+₅·a. After the derivation and prediction, PFRF calculates the average popularity of the files in all clusters, denoted by PW^r_avg(^fi)^,

PW^r_avg(^fi) =



k=1

PW^r_c(^fi)

N_q (4)

where N_qis the total number of clusters holding f_iin the data grid.

3. File selection phase: Between lines 7 and 10, PFRF sorts the set S on the average popular weights in a decreasing order, calculates N_f which is the number of files that might be replicated, and

Author's personal copy

M.-C. Lee et al. / Future Generation Computer Systems 28 (2012) 1045–1057 1049

在文檔中應用泛用型最佳化演算架構於無線網路傳輸技術最佳化問題之研究 (頁 27-40)