Test Database Generation - Research Method

3. Research Method

3.4 Test Database Generation

In order to evaluate the performance of the intelligent information integration system, we must define the test database. The workload consists of a test operation and a test database. The test database identifies what data must be loaded into the data sources, as well as the volume of the test data. Information integration system data sources are disparate and heterogeneous. Information comes from various sources (including structured, semi-structured and unstructured sources) and formats (such as database tables, XML files, PDF files, streaming media, internal documents, and Web pages). For this research, the data sources can be divided into three kinds: relational databases, object-oriented databases, and Web pages. For each data source, we must analyze the actual data and extract statistical data. Data analysis characterizes data in terms of the size of the database, the number of records, the length of records, the types of fields, and the value distributions.

 Determine data values: A number of data types are supported in this research, including long integer number, double precision floating point number, decimal number, money, datetime, fixed-length and variable-length character strings.

We must conduct extensive studies to characterize each data source with several distribution parameters.

Frequency distributions are

computed and standard probability distributions are fit to the data in order to generate the value of test data. Data values are created with these common data distributions such as exponential, normal, discrete, rotating, zipfian² or uniform distribution.

 Determine scaling factors: After determining the value of the test data, we must define how much data should be generated, i.e. defining the database scaling factor.

Generally speaking, the logical size for the test database used for the benchmark is at least equal to the logical size of physical memory on the host(s). For this research, we refer to the AS³AP benchmark standard.

 Open data source: We must determine the test data of the open data source on the Web, but this is problematic. There is in excess of 10 billion pages on the Web, which include HTML files, text documents, PDF files, Microsoft Office documents and other similar data files. We cannot possibly download every page from the Web much less adequate sample size. Even the most comprehensive search engine currently indexes just a small fraction of the entire Web.

As such, it is important to carefully select the so-called “important” pages, so that the fraction of the Web that is

visited becomes more meaningful. In order to select these important pages, we can use several metrics for prioritizing them. For any given web page, we must define its importance using the following methods (Arasu, Cho, Garcia-Molina, Paepcke, & Raghavan, 2001):

 Interest-driven. The goal is to obtain pages of interest to a particular user or set of users.

Important pages are those that match user interest. One particular way to define this notion is through what we call a driving query. For any given query, the importance of a page is defined by the “textual similarity” between the page and the driving query. Assuming that query represents the user’s interest, this metric shows how relevant the page is. Another interest-driven approach is based on a hierarchy of topics.

Interest is defined by a topic, and we attempt to guess the page topics that will be visited by analyzing the link structure that leads to the candidate pages.

 Popularity-driven. Page importance depends on how popular a page is. For instance, one way to define popularity is to use a page’s backlink count. Intuitively, a page that is linked to by many pages is more important than one that is seldom referenced.

 Location-driven. The importance

of a page is a function of its location, not its contents. For example, URLs ending with “.com” may be deemed more useful than URLs with other endings, or URLs containing the string “home” may be of more interest than other URLs. Another location metric that is sometimes used considers URLs with fewer slashes more useful than those with more slashes.

(3) Control Model

The control model defines the environment setup variables to execute the experiment. A common set of parameters including the steady state, the test mode, the test duration, the test sequence and the number of repetitions should be specified as follows.

 Steady State: The benchmark test must be executed in a steady state, in order to return the sustained system performance.

 Test Mode: There are three kinds of test mode, that is, cold mode, warm mode, and hot mode. In cold mode, there is no data in the cache. The system cannot retrieve data from the cache directly. Therefore, the performance in cold mode is usually slower than the other two modes. In warm mode, the data is left in the cache from the prior query. Because of that, the test response time decreases. In hot mode, a query is

executed in cold mode first, and then is executed with the cache data for several times. The average response time is computed.

 Test Duration: Test duration means time intervals of the benchmark.

Each interval must begin after the system has reached steady state and is long enough to generate the reproducible throughput. Each interval must extend uninterruptedly for a period of time.

 Test Sequence: Test sequence indicates the order of the queries to be executed.

 Number of Repetitions: Number of repetitions means execution repeated times.

(4)Performance Metrics

Performance metrics can be divided into two types, i.e., the speed-specific metrics and the relevance-specific metrics. The former consists of the metrics of response time and throughput.

The latter has the metrics of relative recall and precision.

 Response time: Response time refers to the time interval between when a request is made and when the response is received by the requester.

 Throughput: Throughput refers to the number of operations completed by the system per unit time.

 Recall and precision: Recall and

precision refer to two important measures of evaluation of information retrieval. However, it is very difficult, if not impossible, to directly apply these measurements to the evaluation of Web information retrieval systems due to the unique nature of the Web. There is no proper method of calculating absolute recall of search engines as it is impossible to know the total number of relevant in huge databases. The relative recall value is defined in (Clarke & Willet, 1997).

在文檔中智慧型資訊整合於異質資料倉儲和資料探勘之模型、架構、與績效評估-應用本體論、母型綱要、和學名結構 (頁 30-33)