User Interaction Units - System Architecture

Chapter 3 System Architecture

3.4. User Interaction Units

3.4.1. Overview

The user interaction units are responsible for interaction with Smart Crawler user, like

KMS admin. In another word, the units are just the user interface (UI) of Smart Crawler that would get all configurations and information from KMS admin and then pass these data to the network crawling units discussed before, and show the result to KMS admin after crawling work is finished, and other functions.

3.4.2. Web User Interface

Unlike common programs with console UI, we design a web UI for Smart Crawler. For programs with console UI, when there are several programs running, the manager has to use their own UI for configurations one by one, and the setting could not be used by another machines. We think it is not convenient enough, so we propose the web UI. The web UI could provides a uniform usage and passes the configuration to the backend machine that the crawling unit of Smart Crawler is running in. So the manager just uses a browser connecting to the web UI, picks up a target machine, and then he could set all information to another machine. The settings filled in the web UI are also reused by another network crawling unit by pointing out which machine is specified. We summarized the process in the Figure 3-8.

Figure 3-8 the processes of using web UI to exec jobs on C, D

3.4.3. Configuration

Although the web UI provides automatically to connect to target machines instead of the managers do themselves, there is still one problem there. That is the manager has to fill everything again when he wants to ask the program working. In another word, the web UI does not memory what the manager puts in so that he should put all in again.

For solving this problem, we use a database as a bridge. We use a database to store every details KMS admin inputs. That’s why the network crawling units mentioned before have DatabaseObservable for querying user configurations. Because the database would save the user’s input configurations, the user, usually KMS admin, just types in the configurations once and runs Smart Crawler many times without giving something already given last time.

For network crawling units, that might run with the project settings stored in the database without the user asks. It means KMS admin could set a schedule running the specified project without executes by himself.

3.4.4. Typical Scenario

The whole process would become like this: (1) KMS admin inputs setting by web UI and these setting would be stored by a project id in the database; (2) KMS admin chooses what to do and where to do by web UI; it means KMS admin should specify which project id and which target machine; (3) the web UI pass the project id to the target program; (4) the program queries the detail settings by the project id and runs what KMS admin wants; (5) after finishing running, KMS admin could use the project id specified in step 2 to query the result report by web UI. The sequences of using the web UI is indicated in Figure 3-9.

Figure 3-9 the sequence of using web UI to control network crawling units

3.4.5. Project Settings

The first step of the working process mentioned before is KMS admin inputs the project settings. Before we introduce what information should be inputted, we explain what the project is. The project we use here is KMS admin want Smart Crawler to collect information from one data source with one purpose. In brief, the project is a name of some special collected data of KMS admin. So the settings of the project are called project setting.

Project settings are including the detail information of the specified data source, the basic preference, the advanced preference, and so on. The detail information of the specified data source contains the nickname name named by KMS admin, the protocol type, the actually network hostname, the port number, and the description for comment. All are needed except the comment. The basic preference is used by every spider unit. In another word, the basic preference provides the basic configuration. It contains the username, the password, the base URL, the max depth, the type number, and the description of comment. The username and the password are only needed if Smart Crawler needs the pair to access the data source. The base URL and the max depth are essential for telling Smart Crawler where the root of the

document tree is and how many levels it should visit. The type field stands for the tiny of the network protocol. For example, the protocol HTTP is specified in the data source configuration. The type would indicate the HTTP session type is cookie or URL rewriting.

The comment is always optional. We summarize these in Figure 3-10.

Figure 3-10 the sequence of using web UI to control Smart Crawler

The third part of the user input information is the advanced preferences including URL allow rules, form filling information, and multi-steps configuration. As long as KMS admin inputs the pattern and the permission for URL allow rules, Smart Crawler would follow these rules to determine which URL is allowed and which is denied. KMS admin also provides the form setting appearing in the web page and Smart Crawler would fill them automatically. The multi-steps configuration would be discussed later. Figure 3-11 indicates that.

Figure 3-11 the sequence of using web UI to control Smart Crawler

The URL pattern is a regular expression. The allow/deny field is true for allow and false for deny. The optional comment can be filled by the description of the rule. The target URL is the action parameter of the form tag. The parameters are the string with the form:

name1=value1&name2=value2&….

3.4.6. Multi-Steps Configuration

Another contribution of Smart Crawler is Multi-Steps configuration. In the basic preferences, there is one field called base URL. The base URL is the start point to visit for Smart Crawler. And everything under that would be visited, too. This way is a traditional method and used in most spider now. If we treat data on the data source as a tree, the base URL would be the root node and all collected data would be the sub tree of the tree from the base URL node. The Figure 3-12 indicates what we talk about.

Figure 3-12 the sequence of using web UI to control Smart Crawler

But there is not everything we want actually. For example, we want to collect the articles about some topic of the Java programming language in a private forum. We have to login the forum first. Next, we choose the discuss board about Java and make a search for the special topic. Finally, what we want is coming out. If Smart Crawler only has the base URL, the base URL should be set for the login page. So everything linked from login pages would be

downloaded, not only the special topic articles. The collected data would be mess and complex for KMS. So we propose the Multi-Steps configuration.

The main idea of Multi-Steps configuration is to simulate the behaviors of human beings.

KMS admin inputs not only the base URL but others URLs for steps. Take the example just talked. The base URL should be the login page and we specified the max depth is 0. It means only the page is downloaded. Another mean is Smart Crawler just finishes the login operation.

The step 1 is to point the search page of the discuss board about Java and make the search.

The max depth would be set n if we want to download n levels. So Smart Crawler would make the search and download everything appearing in the result.

There is another condition that the traditional way could not handle. The traditional configuration could not configure two independent sub trees or more. Take an example. In the remote FTP, there are several directories in the root, like /intro, /people, /private, and /public.

If we want to collect everything contained by the sub directories /intro, and /public/doc1, we have to create two projects to handle that. Base URL is only set for one URL. In other word, base URL only stands for a sub tree, and it can’t stand for more than one independent sub trees. But in Multi-Steps configuration, we can set base URL as /pub and the step 1 as /public/doc1. The max depth of both is max. Smart Crawler would collect what we really want and with nothing unexpected. We illustrate all in Figure 3-13.

Figure 3-13 the sequence of using web UI to control Smart Crawler

We believe Multi-Steps configuration provides a flexible way to describe the sequences path of crawling although there is more complex than the traditional way. It makes Smart Crawler travel from the document root without downloading unexpected data and it also makes Smart Crawler collect multiple sub trees from the data source in one project. In other word, Multi-Steps configuration provides easier configuring way not only in the depth but in the width. For that, we believe we can describe what we expect from the data source more precisely and appropriately.

在文檔中知識管理中資料自動擷取與追蹤 (頁 40-47)