The Robots.txt and Robots Meta tags with regard to Copyright Authorization

3. Expressions for Licensing All Works in a Website

3.2 The Robots.txt and Robots Meta tags with regard to Copyright Authorization

The Robots.txt and Robots Meta tags were both proposed in 1990s. The Robots.txt is also called the “Robots Exclusion Protocol” (Snyder and etc. 1998), “Robot Exclusion Standard” (Koster, 1995) or “Standard for Robot Exclusion”(Koster, 1994), though it is only a widely accepted convention consented by members of a robot mailing list (Koster, 1994), rather than an official standard with necessary official recognition (Feigin, 2004). Even so, most wide spread search engines, Google (Google, 2008b), Yahoo (Yahoo, 2008b), and MSN (MSN, 2008a) all support the Robots.txt and Robots Meta tags; moreover, both Yahoo (Yahoo, 2008c) and MSN (MSN, 2008b) have tried to introduce some amendments to them. As far as websites’

are concerned, research indicates that, in 2001, around 40% of the websites owned by the global high-rank companies adopted the Robots.txt and Robots Meta tags (Drott, 2002).

3.2.1 Introduction of Robots.txt

The Robots.txt is a file which should reside in the root directory and must be named

"robots.txt". A robots.txt file located in a subdirectory or named as something else is invalid, as software robots only check for this file in the root (Koster, 1994). The following examples illustrate several common uses of the Robots.txt:

Table VI. Examples about Robots.txt

Examples Meaning

1 User-agent: * Disallow:

Allow all robots complete access

29 http://creativecommons.org/license/work-html-popup?license_code=by-nc

2 User-agent: * Disallow:/

Exclude all robots from accessing the entire server

3 User-agent: lycra Disallow:

User-agent: * Disallow:/

Only exclude the access from the robot called “lycra”

4 User-agent: * Disallow: /tmp Disallow:/log

Exclude all robots from the /tmp and the /log folder.

3.2.2 The introduction of Robots Meta tags

Sometimes, the page creators do not administer their own websites. For example, a staff member in a university creates his personal webpage on the website of his department. In this circumstance, it is someone who works in the computer center of the university that is the webmaster having the authority to access the root; the staff member is neither able to access the root directory nor use the Robots.txt to exclude software robots. This disadvantage has been improved by the use of Robots Meta tags:

the “[No]index” tag and “[No]follow” tag, which should be within the page codes (Koster, 1997). Some examples are as follows:

Table VII. Examples about Robots Meta tags

Examples Meaning

Restrict the all robots from indexing a page

Block all robots from both indexing and following links

In case the page creator has the right of access to the root directory, he can adopt the single “Disallow” directive to exclude robots, instead of exhaustively embedding redundant “Noindex” tags in all pages hosted in the same server.

3.2.3 Two functions of Robots.txt and Robots Meta tags

3.2.3.1 The original function: voluntary advice

The original idea of the Robot.txt and Robots Meta tags is to offer a common facility provided by the majority of robot authors to the Internet community to protect websites against unwanted access from their robots (Koster, 1994). They are not

“enforced by anybody and no guarantee that all current and future robots will use them” (Koster, 1994). In other words, in respect of this design concept, the Robot.txt and Robots Meta tags are only a voluntary code; no one will be punished for breaching the access policy.

3.2.3.2 The new function: expressing online copyright authorization

Apart from mere advice, based on a recent noticeable US federal case, Field v.

Google, Inc. (Field, 2006), the Robot.txt and Robots Meta tags have both found their new roles. This case related to the “Cached link” of Google. In order to allow access when the original page is temporarily inaccessible, or allow viewers to compare changes made to pages during a specific period, Google’s search results always includes a link to its own cached copy, which is a temporary repository consisting all source codes of indexed websites (Field, 2006). The plaintiff, Mr. Field, who posted 51 copyright works on his website and “created a robots.txt file for his site, and set the permissions ... to allow all robots to visit and index all of the pages on the site” (Field, 2006) and, with the knowledge of using Robots Meta tags could

“instruct Google not to provide Cached link to a given Web page”, Mr. Field consciously decided to use none of them (Field, 2006). As a predictable result, Google routinely used its software robot, GoogleBot (Google, 2008a), to retrieve the plaintiff’s website, indexed his works and provided the Cache link as well as the search results. Based on these facts, Mr. Field “alleges that Google directly infringed

his copyright when a Google user clicked on the Cached link to the Web pages containing Field's copyrighted works and downloaded a copy of those pages from Google's computers” (Field, 2006). After taking into account the fact that Mr. Field did not take any measure, even though he had the opportunity and ability to employ the “Robots.txt” and Robots Meta tags to exclude any possible software robots or to instruct the search engine to not provide the “cached link”, the federal district court in Nevada held, since Mr. Field “knows the use” and “encourage it”, that he has granted an implied license to Google according to his conscious silence (Sieman, 2007). As a result, Google did not infringe Mr. Field’s copyright at all (Field, 2006).

It is notice that the court in this case suggested that the license from absence of the Robots Meta tags based on two facts: the first one is that, based on the fact that the defendant actually set the Robots.txt, accordingly, Mr. Field, had fully ability and opportunity to employ the tags to prevent Google and, a more important one, Google will stop indexing the websites in terms of the tags employed by the webmasters (Google, 2008a). That is to say, without the above two conditions, a mere absence of the tags could not directly induce an implied license. On this ground, in a recent Belgian case, Copiepresse v Google (Copispresse, 2007), the court found that the newspaper publishers' failure to use standard technical exclusion methods such as the

“Robots.txt” and Robots Meta tags did not amount to an implied license (Smith, 2007).

No matter the absence of the tags can be seen as a implied license, according to the forthcoming cases, we can make a conclusion that, although the original idea of Robot.txt and Robots Meta tags was to set up a code of voluntary advice, based on these verdicts, it is quite clear that the Robots.txt and Robots Meta tags have been far from the “voluntary recommendations without any enforcement”; and they have their new roles in the context of law. A webmaster who adopts the Robot.txt or Robots Meta tags to set permissions to allow robots to visit should absolutely be regarded as granting a license to robots, on the other hand, a webmaster who adopts the

“Disallow” directive or the “Noindex” tag should be regarded as expressing his explicit wish to exclude the robots; in addition, a webmaster who “consciously” does

not use them may also be regarded as granting “a implied license” to such robots. As a result, any software robot which follows the license to gain access to the website or index the collected data does not infringe any webmaster’s copyright and, any robot which disregards the “Disallow” directive or “Noindex” tag but still accesses the website may breach the copyright law in terms of this new function. To sum up, the appearance or absence of the Robots.txt and Robots Meta tags represents the webmasters’ wishes; any robots deliberately ignoring these wishes may be in breach of the law. That is to say, the court in this case considered the Robots.txt and Robots Meta tags as instruments which can be used by the webmasters to express their wish about what kind of robots are allowed, what are excluded and what kind of links should not be followed.

However, even though Robots.txt and Robots Meta tags are taking on more significant roles today, they have not been fully investigated by researchers. Only a few peer reviewed academic papers in relation to this topic have been released (Chau and etc., 2003) and, as a result, sporadic amendment proposals are based on personal experience rather than general principles (Conner, 1996; Koster, 1994).

在文檔中可用於自動蒐集開放網路內容之著作權授權表達法 (頁 42-46)