1.1. Preface
With knowledge-based economy coming, people in academia or in enterprises realize the power of knowledge. There are more and more organizations that use the KMS (knowledge Management System) to automatically create the value of information and knowledge from the experiences and data belong with their selves. Because of the exploded growing up of the Internet, we can say the Internet itself is a super large digital library. People also can use KMS to create the value of the Internet. The created value of the knowledge does increase the competitiveness and bring the much more benefits. More and more success cases are coming out.
1.2. Motivation
In these days, the use of KMS is more and more popular. The KMS getting the input data could tell the user not only what is new and what is important but integrate data input by times according the specific rules by the user. Usually the process of the KMS creating the value of the input data or information is automatically. The KMS admin just provides the input data and rules, and takes time to train the KMS to arrange data expectedly and correctly.
But how does the user collect the input data for the KMS? If there are a lot of data expected?
The better way is the KMS admin uses a program instead himself. The used programs are called spiders or crawlers. These programs can crawl in the Internet and download data automatically according the specific conditions. And then the data can be feed into the KMS and created its value of knowledge faster and more quickly than crawling and arranging by human beings.
The spiders or crawlers usually support some network protocols, not all. The KMS admin has to use multiple spiders or crawlers for its specific protocols. Therefore, to collect data from various data sources, the KMS admin has to use corresponding spiders, be familiar with every spider and manage each used crawler. We believe it will be better to integrate all used crawling programs into one.
Another problem is the spiders only download data from the Internet, so the admin couldn’t know what is downloaded without changed, what is downloaded with changed, and what is new from the last download. It will increase the both loading in local and remote network and servers.
The other problem is the usage of spiders. Although most spiders provide the preferences for the downloading rules, for example, the network bandwidth limitation, the allowed format of data, the permitted size of document, and even the pattern of universal resource location (URL), their need of data from data sources is rear. The KMS admin just gives the start URL and the spider would crawl every document discovered from the URL. But the spider will download something that is not expected, i.e. advertisements.
Using these spiders for collecting data or tracking also provides a benefit. That is we can do some preprocesses just after automatically downloading everything from the data sources, like converting the data format, more precisely programmable content filter, or something else.
But the paper won’t focus the topic but it would be our future work.
To solve these problems mentioned before, we propose a spider architecture to integrate multiple protocols and a detailed configuration called multi-steps. We also implement it called Smart Crawler.
1.3. Research Objectives
There are three main objectives: multiple protocol crawling architecture, tracking on crawled data or information, and multi-steps user interface for configuration.
Multiple protocol crawling architecture
By the growing up of the Internet, there are more and more network protocols running in the world. But each spider supports just some of them. The KMS Admin has to use one spider for one protocol in the worst case. Besides, the KMS admin should be familiar with every used crawler program and manage all of them so that he can’t be concentrated on the management of the KMS. Integration all spiders into one will provide convenient and be easy to be managed by the KMS admin.
To implement Smart Crawler, we choose File Transfer Protocol (FTP) and Hyper-text Transfer Protocol (HTTP) as the target protocols to integrate. The two protocols are very simple and popular in the real world. The former is used to transfer files between computers, the later could be used in web sites and applied in every point. Such like spread news, weather prediction, and personal daily; provide platform for communities, i.e. discussion boards, forums, and Netnews; even bring the e-commerce, like shopping, stocks , and other business activities. How to integrate these two very different network protocol sliders is a big problem.
Tracking on crawled data or information
The KMS is used by more and more individuals or companies getting the value for the future. It is automatic to collecting data by spiders or crawlers and it does also be faster and quicker than crawling by man. But the program always downloads whole data and information on the data sources. So if the KMS admin doesn’t pick out repeated data from the downloaded one, the KMS must process it again. It costs time whether the KMS compares the
document with the processed data or the KMS creates the value of the document. If the crawlers or spiders can distinguish the same and unchanged data, it means the input data is promised newer or updated than before downloaded, the KMS won’t waste any computing time and resources. In the other side, the KMS admin can analysis the refresh/update rate of the data sources by the report of the spiders, he can change the crawling frequency to the best one of the data source. It will save the loading of the network bandwidth and servers in the both local and remote sides.
We implement not only crawling but tracking in our Smart Crawler. The tracking mechanism will provide the function that the program can find the changing status of each file from last time crawling and tell the KMS admin which files changed and which not. So the KMS admin or KMS can fetch data avoiding the repeat and unchanged one. Tracking makes the knowledge management process more efficiently and less loading with KMS.
Tracking for HTTP session with URL rewriting
The pre-condition of comparing one document between times is we can always access the document by its own URL. By this condition, we could determine the corresponding documents in different downloads and do the track process. But there are some websites which maintain http session by URL rewriting. The documents there could be access by the URL with session id that is always embedded and changing with different sessions. That will make the URLs of these documents different in each time crawling by spider. We couldn’t find the pair of these documents and do the track process.
In our Smart Crawler, we propose an algorithm to solve this problem to find the corresponding pair of the downloaded documents in different time crawling.
Multi-Steps user interface for configuration
The configuration of the target documents in the existed spiders or crawlers is too simple
to have less limitation. The spider is usually given a start URL that describes where the spider starts to crawl. Although the spider might provide other settings, for example, the format of documents, the size of documents, the age (time) of documents, and the URL pattern of documents, it might download something the KMS admin doesn’t want. If we consider the structure of the documents in the data source as a tree, the start URL is just one node in the tree. And then the spider would download the leaves of the node and apply other rules.
Take an example, if we would like to download special topic articles in a discussion board with membership, we have to login first, fill the keyword in the search field, and then the result articles are what we really want. For the spider, the only setting way is to give the login URL as the start URL, to set the search form, and then to crawl all finally. Some documents belong to the login page and the search pages are not needed but they still are downloaded by the spider.
So we propose the multi-steps configuration to solve this problem. The multi-steps configuration is more complexity than the original, but it provides more detail and flexibility to the spider when crawling.
1.4. Organization
In Chapter2, we introduce the technologies of the KMS, spiders or crawlers, and discuss the related research. In Chapter 3, we illustrate the architecture of our program and components. In Chapter 4, we discuss the design issues of each topic, and how these problem to be solved. In Chapter 5, we show the implementation result of our program. In Chapter 6, we show some evaluation about our solutions. In Chapter 7, we give the conclusion and future work.