Related Works - 一個方便應用程式使用RSS/Atom的中介軟體

Figure 2-2 Overview of the feed space

The figure above depicts the feed space, divided by four columns which represent different tasks to do with feeds. Within it each column there are specific fields with their respectively players, some of whom will be further described in the following sections.

2.2.1 Windows RSS Platform

The Windows RSS Platform [13] is Microsoft’s answer to the changing web experience from pure browsing to searching and subscribing after the release of Internet Explorer 6.0 in 2001. Though being an integral part of IE7, the Windows RSS Platform provides APIs for other applications in the same environment to access feeds and subscriptions, which is a similar idea to the one proposed by this paper, a platform instead of only a library. More will be discussed on the comparison and discussion section.

Figure 2-3 Architecture of the Windows RSS Platform [14]

2.2.2 Yahoo Pipes

Yahoo Pipes is a web application for non-programmer to aggregate and manipulate feeds [15]. It provides users with a GUI editor to connect inputs and outputs of different functional blocks, each having a specific use like URL building fetching feeds, or replacing text.

Figure 2-4 Yahoo Pipes

2.2.3 Enterprise Solutions

Enterprises begin to adopt RSS to fight information overload with their portals and emails. Three commercial products focusing on helping enterprises to take advantages of the feed technology are Attensa Feed Server [16], NewsGator Enterprise Server [17], and KnowNow Enterprise Syndication Solution [18]. All of them share similar features: being a central server aggregating different feed sources on behalf of the organization, providing an easy-to-use interface for management of subscriptions, delivering news for reading using email clients, browsers or mobile

devices. However, all of them have the same constraints of being only for feed

consumption rather than development, and integration is hard if not totally impossible.

Besides, they are all proprietary platforms and are selling at the price of over thousands of US dollars.

2.2.4 Academic Researches

Three researches are directly related to the feed technology. FeedEx [19] is a feed exchanging system, in which hosts not only fetch feeds but also exchange them with neighbors of similar interests to reduce time lag and increase coverage. Based on Scribe and Pastry, FeedTree [20] provides software for subscribers and publishers to join a structured overlay to let them distribute feeds in a multicast way and poll for updates cooperatively. Also based on Pastry, Corona [21] does almost same thing as FeedTree but focus more on load balancing of nodes in the overlay to achieve better performance. In short, all of them are P2P-related researches which focus on the scalability of feed dissemination.

Figure 2-5 Corona Architecture

3 System Architecture 3.1 Overview

Figure 3-1 System Architecture

The diagram above depicts the components of Feed Middleware, which will be described in details in the following sections.

3.2 Feed DB, Storer and Retriever

Feed DB is a database to store all entries of all feeds and other relevant

information. Regardless of what format a feed is in, entries of all subscribed feeds are stored in two different ways. First they are stored in a normalized form which only captures the essence of an entry including its unique identifier, title, link, description and timestamp. Second, they are stored in a serialized form which preserves all of its attributes. The rationale behind these redundant stores is that both performance and flexibility are desired, and that storage is inexpensive and it can be easily expanded.

A list of subscribed feeds with their attributes including their last updated time, fetch frequencies are also stored. The following figure is the database schema.

Figure 3-2 Database schema

Feed Storer has the knowledge of both the object representation and the database schema. First, it filters out old entries having the same ids. Then, it transforms only the updated ones into tuples suitable to be inserted into the Feed DB.

Feed Retriever is responsible for retrieving feed entries from database and formatted them in the form requested by client applications. Frequent retrievals are alleviated by using a memory caching system so as to provide fast response. Since all entries are stored in the database, merging different feeds into a single one can be done but using a SQL SELECT statement with an IN expression constraint test for inclusion in a specified set of feed IDs. This mechanism also enables the use of OPML (Outline Processor Markup Language) which is often used as an XML format of subscription lists.

3.3 Feed Sweeper, Monitor and Fetcher

Feed Sweeper is a scheduled process to constantly examine the status of every feed, marking it dirty if the current time is later than its last updated time plus its fetch frequency. Dirty feeds are then put into a queue for Feed Fetcher to re-fetch.

Instead of guessing if there is update for a blog, feed Monitor leverage the knowledge of ping servers by downloading change logs from them, scan through them for interested feeds that are updated, and put them into queue for Feed Fetcher to re-fetch.

Feed Fetcher is responsible for fetching feeds, parsing them into objects, and storing them into Feed DB using Feed Storer. It contains a pool of worker threads to do these processes concurrently in order to achieve a higher throughput. Feed Fetcher begins to fetch a feed when notified by Feed Sweeper of a feed being marked as dirty or by Feed Monitor of feeds being updated. It uses various HTTP techniques which will be mentioned in the implementation section later to reduce bandwidth usage. A hash code is also kept for each feed to compare content freshness besides the HTTP ETag header to ensure further processing is needed only for updated feeds. A feed parser is used to parse different XML-based feed formats into a consistent object model.

Instead of using mathematical or heuristic methods to dictate the fetch

frequencies of feeds, which is complicated and not in the scope of this text, aids are provided to the users to determine the frequency of the feeds of interest. Fetch frequency can be divides into different levels. Level 0 is set for those feeds which updates have been sent to ping servers, and in turn realized by Feed Monitors. The fetch frequency is 1 day for level 0. Level 1 is the default one for every feeds, which is 30 minutes. This level is suitable for blogs or non-frequently updating sites. People

normally do not mind if they are 30 minutes late to know some trifles of their friends.

When an update is received from ping servers for level 1 feeds, it is set to level 0 because it can be assumed that subsequent updates will also be received from ping servers so there is no need to fetch that often. Conversely, if daily fetch for a level 0 feed finds missed updates, the fetch frequency of the respective feed is set to level 1 because ping server may not be reliable for that feed any more. Level 2 is 5 minutes for news or real-time updating sites. Finally, users can always set the exact fetch frequency directly to values other than these three levels.

3.4 Feed Notifier

Feed Notifier is initiated by Feed Fetcher with only updated feeds entries. It checks the subscription tables in the database to see if there is anyone who is interested in those updates. It one is found, a separated thread is dispatched to push those entries to the respectively endpoint using XML-RPC. XML-RPC is a simple way to communicate with a remote entity. It is possible to use more reliable mechanisms like message-oriented middleware directly or through adapters.

3.5 Interfaces

Applications access feeds by sending simple HTTP requests to the Feed Middleware similar to retrieving feeds from any web servers. But Feed Middleware allows developers to specify how feeds should be served using arguments. The following tables list all operations provided by Feed Middleware with their function descriptions:

Resource /feed HTTP Method GET

Description retrieve a single feed of entries

Arguments url a single URL

type rss, atom, json

len how many entries to retrieve Example GET

/feed/?url=http://digg.com/rss/index.xml&type=atom&len=50

Table 3-1 Retrieve a single feed of entries

Resource /feeds HTTP Method GET

Description retrieve a set of feeds of entries

Arguments url comma-separated list of URLs type rss, atom, json

len how many entries to retrieve

Example GET /feeds/?url=http://digg.com/rss/index.xml, http://rss.slashdot.org/Slashdot/slashdot&type=json&len=100

Table 3-2 Retrieve a set of feeds of entries

Resource /opml HTTP Method GET

Description retrieve a set of feeds of entries defined by an OPML file Arguments url an URL of an OPML file

type rss, atom, json

len how many entries to retrieve Example GET

/opml/?url=http://share.opml.org/opml/top100.opml&type=rss

Table 3-3 Retrieve a set of feeds of entries defined by an OPML file

Resource /sub

HTTP Method GET

Description retrieve all subscribed feeds as an OPML file

Example GET /sub

Table 3-4 Retrieve all subscribed feeds as an OPML file

Resource /sub HTTP Method POST

Description subscribe to a lists of feeds

Arguments url comma-separated list of URLs Example POST /sub/?url= http://digg.com/rss/index.xml,

http://rss.slashdot.org/Slashdot/slashdot

Table 3-5 Subscribe to a lists of feeds

Resource /sub HTTP Method DELETE

Description unsubscribe a lists of feeds

Arguments url comma-separated list of URLs

id subscription_id

Example DELETE /sub/?url= http://digg.com/rss/index.xml, http://rss.slashdot.org/Slashdot/slashdot&id=1

Table 3-6 Unsubscribe a lists of feeds

3.6 Program Flow

Assume that there are already some feeds in the database, all with different fetch frequencies. A work queue is maintained for Sweeper and Monitor to communicate with Fetcher in the producer-consumer paradigm. Feed Sweeper is scheduled to put outdated feeds into the queue. By outdated it only means that the feed has not been fetched for some specific period time. It does not necessary mean that it is updated.

On the contrary, Feed Monitor leverages update logs by ping servers to put actually updated feeds into the queue. Upon receiving fetch requests from the queue, Feed Fetcher fetches those feeds, parses them into objects, filters out old entries, stores new ones into the Feed DB using Feed Storer, and dispatches notification threads using Feed Notifier.

From the point of the view of the developers, they only have to send HTTP requests in order to subscribe, unsubscribe to feeds, or retrieving them directly in a couples of different ways. If they subscribe to a feed, updates will be pushed to them.

Figure 3-3 Program flow of Feed Middleware

4 Implementation Details

4.1 Overview

As mentioned in these two articles on middleware “dark matter” [23] [24], Python is one of many tools to solve real world integration problems when EAI, MOM, Corba, and J2EE are just too complex and over killed. Python [25] is a dynamic object-oriented programming language that can be used for many kinds of software development. It is well known that Google used Python intensively for many of its systems. There are also extensive standard libraries and many 3rd party tools like the brilliant Universal Feed Parser. Besides, Python is available for Windows, Macintosh, Linux and a lot other platforms. Due to these reasons, Feed Middleware is developed entirely in Python.

4.2 Feed discovery

Feed discovery is to get the feed URL of a website given its own URL. This feature can be handy when retrieving a feed for the first time because users will not need to know the feed URL in advance. Instead of having default names like

index.html or index.php for the entrance of a websites, there is no similar convention for feeds. However, webmasters use a way similar to referencing external stylesheets and scripts to associate a feed with a website by adding a link tag within the head section of a webpage. Therefore, the following steps can be used to get the feed URL:

z retrieve the HTML file of a website z use regular expression to find all link tags

z for each link tag if its type is “application/rss+xml”, get its href attribute z use the href attribute and the original URL to form a feed URL

4.3 Feed fetching

A feed is just yet another object transferred over HTTP like a HTML document.

Techniques used by browser and other HTTP clients can be directly employed to speed up fetching and reduce bandwidth usage. One of them is caching with validation. For example, a client issues a request for a feed and the server responds with the XML document in the payload and optional ETag (entity tag) and/or Last-Modified headers. The client may cache the document so when it wants to request for the same feed next time, it can attaches If-None-Match and/or

If-Modified-Since headers with previous values to check if its cached version is still valid. If it is, a status code 304 Not Modified is returned without payload; otherwise, a normal response is returned. ETag is a strong content hash validator which will

change accordingly with the content itself. Last-Modified is a weak validator derived implicitly from the last modified time of the content. They both serve as good

mechanisms to reduce unnecessary requests.

4.4 Feed parsing

There are a number of feed parsers available to tackle the problem of the chaotic feed formats. The Windows RSS Platform can be used for .Net environments. For Java, Rome [26] is probably the most promising one with a strong community and sub-projects to handle other issues such as fetch and store. Jakarta FeedParser [27] is an alternative for Java with a SAX instead of DOM-based API. For Python, Universal Feed Parser [28] is the one to use. UFP is chosen not only because the middleware is

is very important because feed publishers, being spoiled by browsers accepting all kinds of HTML documents, tend to produce ill-formatted feeds. Besides, they may also mix up entities of different formats which will fail parsers that are completely conforming to the specifications.

Universal Feed Parser tries to expose all values of non-standard extensions as possible. For example, each entry of the feed of the famous Web 2.0 news site Digg [29] comes with a digg count (<digg:diggCount>42</digg:diggCount>), that is, the number of votes it gets from the users of Digg. The value can simply be accessed by directly d.entries[i].digg_diggcount. However, as of the latest release of UFP, attribute values are not preserved. Therefore, a modification must be made for extension like the Buy.com RSS 2.0 Product Module Definition [30], which product information is formatted as attributes (<product:content price="$2,021.99"/>), so that values are stored in a dictionary that be accessed by a key composed of the tag name and an under scroll (d.entries[i].product_content_['price']).

4.5 Interfaces

The pull interface follows the REST style and is implemented using the web.py framework. REST (Representational State Transfer) [31] is an architecture style to design network-based software. The principle of REST is to model application states and functionalities as resources which can be addressed using a universal syntax and be interacted with by exchanging representations via a simple and uniform interface (often HTTP). web.py [32] is a framework written in Python for developing web applications with REST in mind. It uses Python classes and member functions to define resources and their respectively HTTP method interfaces. A Python tuple is used to map URLs to resources.

Feeds output formats can be RSS 2.0 or Atom 1.0 regardless of them originally

formats. Other kinds of formats can be easily supported by simply defining respective templates. For object output, JSON (Javascript Object Notation) [33] is used to facilitate object exchange across different languages.

XML-RPC is used for the push interface for its simplicity and many

implementations for different languages. XML-RPC [34] is simple a way to encode and decode method calls, arguments and return values in XML and transfer them over HTTP. A client registers its XML-RPC endpoint, interested feeds and attributes with Feed Middleware. When updates are available for those feeds, only updated entries are sent to the client by an XML-RPC method call with those entries being

XML-formatted object arguments.

These two interfaces are put into use by two demo applications written in different languages as show in the demo chapter.

4.6 Tools and Libraries

Many open source tools and libraries are used in Feed Middleware. They are listed below:

Name Usage License

Python Core Python license

Universal Feed Parser Parsing feeds MIT

SQLite Embedded database Public domain

memcahed Memory cache BSD

web.py Web framework Public domain

Table 4-1 Tools and libraries used

5 Scenario Demonstrations

5.1 Ajax Product Spy

Ajax (Asynchronous Javascript and XML) is a term defined by Jesse James Garrett [35] referring to the combination of techniques, including the Javascript XMLHttpRequest (XHR) object, DOM manipulation and XHTML, involved in the development of interactive web applications. The core of Ajax is XHR [36], which enables Javascript embedded in a webpage to issue asynchronous HTTP requests without the needs to refresh the whole page, thus resulting in a fluid user experience.

Figure 5-1 Architecture of the Ajax Product Spy

This demo is a single webpage which contains product information updating automatically without refreshes. Such an application can be integrated into an existing internal portal of an enterprise to show the inventory of itself or its competitors.

The webpage contains an Ajax Request object from the Prototype Javascript Framework [37] to make repeated requests to Feed Middleware asking for entries of the interested products serialized in JSON. The callback function of the request inserts new products to the webpage by updating the DOM tree. Also demonstrated is the flexibility of Feed Middleware directly returning elements like product prices and

images of the Buy.com RSS 2.0 Product Module Definition [29]without any modifications.

Figure 5-2 Screenshot of the Ajax Product Spy showing product information

Figure 5-3 Screenshot of the Ajax Product Spy after an update

function fetch() {

new Ajax.Request('/feed?url=http://localhost:8080/’+

’demo/buy.xml&type=json&len=50&obj', {

method: 'get',

As mentioned above, polling is used as the underlying technique to enable publish/subscribe semantics of feeds. In order to receive timely notification for urgent

stuff like system outage reports, an event notifier is written to send feed updates to users via instant messages, emails or SMS messages.

Figure 5-4 Architecture of the Bug Notifier

The Bug Notifier is written in Java to demonstrate that applications using Feed Middleware can be language neutral. It contains a simple XML-RPC server offered by Apache XML-RPC implementation [38] for Java to listen for new entries of subscribed feeds. Besides, it also leverages JMSN [39], a Java Microsoft MSN Messenger clone, to communicate with the MSN network and send feed entries as instant messages to users.

The following scenario shows that when a tester of a system reports a bug, the developer of that system, having subscribed to the bug feed, will be notified by a MSN robot with relevant information to lead him to that bug report page. Moreover, since the bug reporting system and its corresponding feed are password protected, this program also shows the capability to handle security feeds which is not supported by

在文檔中一個方便應用程式使用RSS/Atom的中介軟體 (頁 16-0)