Chapter 2 Background
2.4. Related work
We will introduce some exist products and tools about their functions and features. And then we will compare them with the objectives we propose.
Heritrix
Heritrix [13] is a purely Java 5.0 crawler and has tested on the Linux. It can achieve HTTP/HTTPS documents recursively and mirror thousand of independent websites and resources non-stop collection for configurable download limitations. It can fill information into the form automatically. It also respects the robot exclusion protocol. But it needs a sophisticated operator to configure crawls within machine resources, or Heritrix will exhaust everything when it is running. Besides, it only provides command line interface (CLI).
JoBo
JoBo [14] is a purely Java 1.3 crawler with both CLI with XML configuration and graph user interface (GUI). JoBo can recursively mirror HTTP documents with the user predefined depth by a given seed. Besides, it supports download limitations and the robot exclusion protocol. Jobo also provides the automated form handling and cookie support.
GNU Wget
GNU Wget [15] is a free software package written in C for retrieving files using HTTP, HTTPS and FTP. It provides the recursively crawls and converts the absolute links to the relative one. So the downloaded documents will be links with each other. It also support automated form filler and cookie. But it only provides the CLI and has to be set for each job.
Open WebSpider
Open WebSpider [16] is an Open Source multi-threaded Web spider and search engine.
The crawler part is designed in C and the search part is written by PHP. The user only gives the seed to the Open WebSpider and then it will cache and index HTTP documents found.
After that, the user can use the Web part to search what he wants. The configuration of Open WebSpider is few and the crawler only provides the CLI settings.
Where Spider
The purpose of the Where Spider [17] software is to provide a database system for storing URL addresses with GUI operation. The software uses a pure XML database which is easy to export and import. It is designed by C# with .NET 1.1 frameworks. It not only rips the links but provides the ability to browser the documents offline. But for crawling, the user can’t determine first what is inclusive and what is exclusive.
WebScraper Plus+
In a sentence, Web Scraper Plus+ [11] takes data from the Web and puts it into a spreadsheet or database. It has a simple wizard-driven configuration for tasks. It provides a lot of crawl customization, like cookie, automated form handler, control depth, and number of pages. But the user have to explore the original source HTML code and find which tags have what he wants in. And the setting is very complex.
Teleport Ultra
Teleport Ultra [12] is a very popular all-purpose high-speed tool for getting data from the Internet. It can download all or part HTTP/HTTPS of a website with the user’s restriction to the local and save with the rewrite relative links or original directory tree structure. It can borrows the browser’s cookie cache that letting the user performs complex authentication with his browser first and then crawling by it. It also can synchronize the offline copy with the remote side.
PhoPicking
PhoPicking [18] can download images and only images from the Internet albums. It uses the silver key to describe the wanted images and the operations of download. Not only the user can create his own silver key and share it but he can use the shared key created by someone else so that sharing the album images and crawling operations. Besides, the user can modify the exist key for new albums. But the process of settings is very complex and it needs many times practices. The user should understand the HTML document source and find what he really wants himself step by step. The user should point out which tag should be noted and crawled. The user also should point out what text in the document standing for the author, the title, the description of the target images.
Google Mini
Google Mini [19] is an integrated hardware and software solution designed to help the organization make the most of its digital assets. By setting the seeds, Google Mini will crawl everything found and indexes them in its system. It provides the search interface just like Google Search for these indexed data. It also provides the reporting for the administrators.
The Google Mini summary reports include the search or query result, and the broken URL belonged in the crawling set.
Comparison
The table 2-1 shows the comparison result of these exits products and our Smart Crawler.
From the table, we can know that the tracking is lack in all the other products. These programs just only download everything that might be expected. Maybe some of them might support an interface for search. But the user can’t know what’s different with the last time soon. Besides, most products are only feeds one seed, called One-Steps that only describes data roughly. It is not flexible enough. It is very possible to crawl something unwanted.
Protocol UI Cookie Form CD rules Indexing Tracking M-Steps Report
Heritrix HT,HS CLI Yes Yes
JoBo HT GUI+CLI Yes Yes Yes
GNU Wget HT,HS,FTP CLI Yes Yes
Open Spider HT Web Yes
Where Spider HT GUI Yes Yes
Web Scraper Plus+ HT GUI Yes Yes Yes Yes Yes Yes
Teleport Ultra HT,HS GUI Yes Yes Yes
PhoPicking HT GUI Yes Yes Yes Yes Yes
Google Mini HT,HS Web Yes Yes Yes Yes Yes
Smart Crawler HT,FTP Web Yes Yes Yes Yes Yes Yes Yes
Table 2-1 the comparison of the products and our Smart Crawler
The word “CD rules” means Customized Download rules, “HT” means HTTP, “HS”
means HTTPS and “M-Steps” means Multi-Steps..