• 沒有找到結果。

可用於自動蒐集開放網路內容之著作權授權表達法

N/A
N/A
Protected

Academic year: 2021

Share "可用於自動蒐集開放網路內容之著作權授權表達法"

Copied!
74
0
0

加載中.... (立即查看全文)

全文

(1)













 



Expressions of Copyright Authorization Used for

Automatically Acquiring Free Internet Materials







  

(2)

 



Expressions of Copyright Authorization Used for Automatically

Acquiring Free Internet Materials

    Student: LIAO, Hsien Jyh

                                 Advisor: YANG, Chyan 

 



A Dissertation

Submitted to Institute of Information Management College of Management

National Chiao Tung University in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy in Information Management May 2009

Hsinchu, Taiwan, the Republic of China

(3)

 



                                               



#$%&'()*!+,-./012345673489:;<=>/? @@ABC#$%&'DE34<FGHIJKLMNOPQRSLTUV# $%&'<RW-.XQJYZ[/\]^_0`abc(#$[de'EL# $%&'f<RghiJjklmnopqr(#$[1ste'ELRu `ab<N_/Kvwnopqrxy_z{c1s|}(#$[te<N ~MNO O€w]/‚lmƒpqr„#$%&'f(…†5-‡t eˆ‰<N~Š/‹`ŒE<ŽMNOJv‘/VRu O’“p”/ R•–—˜™Nfm‰š›’“MNO O<€/R•–œ—˜1sc™no pqr|}/LRg^<ž•ŸJ  ¡%¢=£u—˜“¤[¥.¦< O’§•¨©RuLª« CC O/¬wCCFE/­®Ru¯LRobots°±²Robots.txt and Robots Meta tags³ <´šµJCCFE—˜¶·¸cl—™pqr1s|}< O¹º]»¨< ¼Ÿ½¾(R¿J­R•–/À2´šÁ›<°±˜Â/Robots°±?—˜ÃÄ ¿’“MNO O€<ÅNJ

(4)

Expressions of Copyright Authorization Used for Automatically

Acquiring Free Internet Materials

Student: LIAO, Hsien Jyh Advisor: YANG, Chyan

Institute of Information Management National Chiao Tung University

Abstract

Internet libraries have been gradually popular in recent years. The appearances of “Free Content” and “Open Content” actually affect the amounts of Internet libraries materials. However, copyright is one of the most important issues of construction of a successful Internet library. In fact, how to legally collecting works in an economic way is a great challenge for librarians. Launching software robots to automatically acquire works on the Internet is efficient but with high potential legal risks, because the robots can not automatically comprehend the real copyright authorization scope. As a result, the libraries distribute or reproduce the collected works may infringe the copyrights of authors. To solve this problem, an ideal solution is designing a scheme which can be identified by software robots and can be used to fully express copyright authorization scope.

In this thesis, we propose two mechanisms which both fulfill the two requirements above: one is an expansion of the Creative Commons license, the CCFE, and another is a revised edition of the Robots.txt and Robots Meta tags. The CCFE can reduce one of the main disadvantages of the original CC: machine-readable metadata can not be easily embedded in digital files. In addition, with some extra commands and tags, the Robots.txt and Robots Meta tags can also be used to express copyright authorization scope as well.

(5)

ACKNOWLEDGEMENT

   ! "#$%&'()*+,-&'(./012345678 9:;<=>?@$A BC!DE&'(?FGHIJ8KL ?M+NO:PQRSTU8VWXAY6!Z7G[G2\]^_` a@bc ?M+dePQ!fgD8MM++5h:ij@kbc ?AYlmnQfo&"pq(! rst)NOuvw&?xyz{|*}~W€ 7fNO?t)‚?yWƒ„…†‡ˆy‰ŠTU‹! Œ‡Ž€'(‘?’“”?"•f–— €!

(6)

Contents

Ê...II Abstract ...III Content ...V List of Tables... VII List of Figures ...VIII

1. Introduction ... 1

1.1 Research Background... 1

1.2 Research Motivation ... 4

1.3 Possible Ways to solve the Copyright Problem ... 7

2. Literature Review and Terminologies Overview ... 13

2.1 Internet Library ... 13

2.2 Internet Copyright ... 14

2.2.1 The diversities and harmonization of international copyright law... 14

2.2.2 The rights within the term “copyrights”... 16

2.3 The Software Robots... 17

2.4 The DRM and Other Related Measures to Control Copyright... 19

2.5 Open Content and Free Culture... 20

3. Expressions for Licensing All Works in a Website ... 23

3.1 Creative Commons License Framework ... 23

3.1.1 The basic of the CC license... 23

3.1.2 How to use the CC license in different countries... 26

3.1.3 How to license and mark works with CC license... 28

3.2 The Robots.txt and Robots Meta tags with regard to Copyright Authorization Expression ... 33

3.2.1 Introduction of Robots.txt ... 33

3.2.2 The introduction of Robots Meta tags... 34

3.2.3 Two functions of Robots.txt and Robots Meta tags ... 35

3.3 Uniqueness of Robots.txt and Robots Meta Tags ... 37

3.4 Few Deficiencies of Robots.txt and Robots Meta tags in Respect of Copyright Authorization Expression... 38

3.4.1 Some uncertainties with respect to new authorization function... 38

3.4.2 No appropriate tags to cover all copyright rights possibly infringed by software robots ... 39

3.5 Adding Tags to Fully Express Copyright Authorization Scope and Dismiss Ambiguous Old Tags ... 41

(7)

4.1 Introduction ... 43

4.2 Showing CC Licensing Information of Work in Part of Website ... 43

4.3 Embedding CC Licensing Information in Body of a File ... 44

4.4 Storing CC Licensing Information in Name of a File and CC File Extension Protocol—CCFE ... 45

4.4.1 Which attached part to a file is proper to store CC licensing information? ... 45

4.4.2 The essential elements of CC licensing and how to express them... 47

4.4.3 CC File Extension Protocol—CCFE... 48

5. Conclusions ... 51

5.1 Comparisons... 51

5.1.1 The comparison of Robot.txt and Robots Meta tags and CC licensing scheme in respect of identically licensing all works ... 51

5.1.2 The comparison of showing licensing information in page, embedding information in body and storing information in filename (CCFE) ... 52

5.2 Implications of the two methods ... 54

5.2.1 The implication to Internet library creator ... 54

5.2.2. The implication to copyright law—the scope of fair use ... 55

5.2.3 The implication to free culture and open content movement... 55

5.2.4 The implication to the computer and information science researchers ... 56

5.3 Future Works and Further Suggestions... 56

(8)

List of Tables

Table I A summary of the four models in risks, costs and amounts of collections ...6

Table II An example of the three formats of CC license...24

Table III Four options in CC license...24

Table IV Six different choices of the CC license...25

Table V Different legal code of the same Attribution license in two jurisdictions....28

Table VI Examples about Robots.txt...33

Table VII Examples about Robots.txt...34

Table VIII Possible Copyright Infringement caused by robots and the tag………...40

Table IX Examples about Robots Meta tags ...41

Table X Examples about three new Robots Met tags ...42

Table XI Examples of CCFE file names and their meanings...49

Table XII The syntax of the popular file naming systems, URL and CCFE...50

Table XIII The differences between the two approaches in respect of licensing all works in a page………52

(9)

List of Figures

Figure 1 Libraries in the four quadrants...4

Figure 2 A snapshot of SourceForge’s “Terms and Conditions of Use”...9

Figure 3 A snapshot of a document in Scribd...10

Figure 4 The process of how a software robot works...18

Figure 5 A sample image with the CC license marker ...29

Figure 6 A part of a Web page containing identical licensing information...30

Figure 7 A tool offered by the CC website to generate digital-code...32

Figure 8 The results of CC digital-code generator...32

(10)

1. Introduction

1.1 Research Background

Libraries are important to culture development and its influence is gradually increasing in the today’s Internet age; because the Internet effectively widens acquisition of libraries materials (Hundie, 2003), broadens the accessibility of libraries (Barker, 2001) and encourages communities to share information, rather than restricting access to it (McCray and etc., 2001). For example, the Citeseer (Citeseer, 1997) is a well known and popular online digital library. A large number of academic papers related to computer science can be searched on it (Giles et al, 1998). One important part of Citeseer is the software robot (“crawler” or “spider”), which can retrieve and store all related papers in Adobe Portable Document Format (PDF) or PostScript (PS) format from other Web sites (Raghavan et al, 2001). Citeseer then indexes these documents. Users may search Citeseer for documents pertinent to their area of research, and users may download one or more documents as required.

The first possible concern of an Internet librarian or library constructor is the amount of collections in the library. For example, Citeseer only focuses on the research papers in relation to computer science and, in order to acquire as many papers as possible, it employs software robots rather than manually collecting papers on the Internet. Generally speaking, an Internet librarian or a library constructor prefers collecting the largest amount of collections subject to the budget limit and the subjects. In the Internet world, software robots which can automatically acquire materials are a popular choice to achieve this goal. Moreover, a software robot with screening ability, such as keywords selection, can also help the library constructors to choose the works belonging to the preset subjects.

The next concern for an Internet library, along with the growing of the collections, is the copyright issue which is very essential to libraries; in fact, it may be the one which librarians most concern about, no matter for a traditional mortar-and-brick

(11)

library or a digital library (Lopatin, 2006; McCray and etc., 2001). The copyright issue is arose when the collection of the library is still copyright protected. According to modern copyright laws, such as 17 U.S.C. 106 and the WIPO copyright treaty, the creators of a copyrightable work automatically own the copyright of the works upon completion; and no one can reproduce, modify or distribute such works without the owners’ consent (Rao, 2003). That is to say, copyright is one of the important issues which could impede the development of digital library because the dissemination of copyrighted works, one of the basic functions of a library, could result in copyright infringements (Bolin, 2006). In fact, subject to other same conditions, the amount of collections in a library free from copyright infringment allegations is definitly less than it of a library disregarding any copyright issues.

Before deeply discussing the collecting methods and copyright issues, it will be very helpful to examine several illustrative websites or libraries which acquire their collections via the Internet. We especially focus on what kinds of works in these sites, how these sites acquire collections and how they circumvent possible copyright infringement allegations.

The first example is the Internet Archive, also called as “WayBack Machine” which is an archive mainly consisting of copies of past Web pages on the Internet with the use of software robots (Internet Archive, 2009). Due to the fact that the Internet Archive is an non-commercial organization and its main purpose is reserving the historical data on the Internet rather than launching time-consuming negotiations with authors, the Internet Archive relies upon the ’fair use’ and other related copyright law exemptions for libraries to be the defenses against potential copyright infringement allegations(Hirtle, 2003).

The next example is the websites which provide a Web space for authors to upload their own articles and for contributors to publish others’ works with fully permissions, such as the Scribd and the Issuu (Scridb, 2009; Issuu, 2009a). In fact, a website, like Scribd or Issuu, is an agent or mediator, which only offers an platform where right owners and users could interchange with each other: right owners could release their works on the library site as long as grant some copyrights and, accordingly, the users

(12)

could choose the works not only meeting their specific purposes but within the scope of authorization as well. As soon as these uploaded files are alleged to infringe any copyrights, the webmasters will instantly remove all suspected materials whenever receiving notices (Issuu, 2009b). In other words, an library adopting this strategy counts on the licensing from authors as well as the Safe Harbor exemption, such as 17 U.S.C. 512, as it does not precisely examine whether the contributors have real authority or not.

We can find out that the first two examples both rely upon the exemptions of copyright laws. Another straight way to avoid potential copyright infringement allegations is constructing a website where all collections are owned by him and, no one, except the librarian himself, could have rights against him. In other words, the librarians may contract with the content owners or the right holders and make a proper arrangement of the benefits. For instance, ACM Digital Library only collects all articles subject to its copyright terms (ACM Digital Library, 2009). Nevertheless, because the negotiation process may be costly as well as direct communication to the numerous authors on the Internet is almost impossible; Internet libraries belonging to this model are all business, main-stream publishers or media. For instance, BBC built a trial site, BBC Creative Archive, to release more than 500 full TV programs (BBC, 2006).

Moreover, a similar example is only focusing on the work without copyright protection. For example, Project Gutenberg announces to encourage the creation and distribution of eBooks, mainly the works in public domain (Hart, 2004). That is to say, all collections in this Website merely consist of public domain or out-of-copyrighted works and, as a result, no one could challenge a depository of this kind about the copyright.

In fact, the present Internet libraries may adopt one or more strategies rather than a pure one. For example, the main materials of the Project Gutenberg are in public domain under US Copyright law, as long as few materials are subject to authors’

(13)

permission1. The Citeseer is another example, which not only employs software robots to collect articles on the Internet, but also allow authors to submit their article to this library (CiteseerX.ist, 2009).

1.2 Research Motivation

As we being above-mentioned, the two concerns--how to collect works and how to circumvent potential copyright infringement allegations--are very important to the Internet library constructor. The foregoing examples demonstrate several strategies adopted by the website constructors in respect of these two essential concerns. In terms of the first concern, there are two choices available to the website operators: one is employing software and another is collecting works manually. As to the copyright issue, specifically examining copyright to make sure how he can use the works is one option; another option is relying upon the copyright exemptions. In fact, an Internet library is a website from the users’ viewpoints. That is to say, an Internet library constructor may adopt the strategies similar to the website operators. Therefore, if we focus on “employing software robots to collect works” as well as “examining the copyright” and use these two as the vertical and horizontal axes, the websites can

1 http://en.wikipedia.org/wiki/Project_Gutenberg Non-exam copyright Non-Robot Robot Exam I II III IV ex: IA ex: Scribd ex: ACM

(14)

be placed in the one of the four quadrants in the following diagram: Figure 1: Libraries in the four quadrants

In the first quadrant, a library (Model I Library) relies on the traditional library exemptions to avoid potential copyright infringement allegations. The second (Model II Library) and the third (Model III Library) both depend on the licensing of authors, but the Safe-Harbor exemption is more important to Model II Library the because the libraries of this model, at most times, do not explicitly monitor the correctness of the licenses it obtained, rather disseminating works in good faith. Moreover, libraries merely focus on the materials of the public domain should be placed in the third quadrant as well.

In light of the various strategies, the risks of copyright infringement allegations are different as well: not surprisingly, libraries belonging to the first and second models have the highest risks; the reason will be rendered in the following sections. On the contrary, the risk of a model III Library is relatively low. However, in the real world, the lower risk is not free at all and, in fact, is relatively expensive: As to a library of the third model, the time and money spent in completing the negotiations between the publishers and authors are quite significant. On the other side, the cost in respect of confirming copyright authorization scopes of the other two models are relatively low: libraries of the first model do not pay any attention on it and, libraries belonging to the second model almost pay nothing neither because a Model II library only removes works whenever it receiving notices.

Apart from the concern about the copyright infringement, another important concern is about the way to create collections in the library. As we have seen, the libraries of the first one model clearly face a higher legal risk than libraries of other three models. In general, the reason of taking such high risk is that, subject to the same budget, the total amount of collections in a model one library is higher than the other three models and, at the most times, the amount of collections is one of the most critical issues to a library which may actually affect the users’ favors. The reason why a library of the first one model can acquire more works than the others is that it

(15)

employs software robots to collect works on the Internet. In respect of the huge number of collections in the libraries of the first model, fair use, or other general copyright exemptions, is the only effective way which libraries of this model could account on because the total number of works is too massive to explicitly identify the scope of copyright authorization.

On the other hand, libraries of the other two models collect works without any software robot. Since a library of the model II depends on favors of the contributors or authors, the constructors of a library of this model can not passively decide the total amount of its collections; as a result, in general, the total amount of collections in a Model II library is less than it in a Model I library.

As to the other the third model, the amount of collections of a library of this model is relatively limited, because it collects works by hand and, in general, the human’s work speed is less than an unstopped software robots. For example, in spite the amounts of collections in some present libraries, such as ACM Digital Library (ACM Digital Library, 2009), are relatively large; however, comparing to the total number of works on the Internet, the collections of a library belonging to this model are still relatively limited because such libraries have to be subject to their budgets. A summary of these three models are also shown as follows:

Table I. A summary of the four models in risks, costs and amounts of collections

Copyright Policy Risk Cost Number

I General copyright exemptions: fair use etc. High Low Almost unlimited II Licensing from authors and the Safe

Harbor exemption

Medium Low Limited

III Licensing from authors Low High Relatively

limited The Model I and Model II libraries both depend on copyright exemptions, however, the traditional library exemptions could not directly and clearly apply under this circumstance since the conditions are not satisfied (Bolin, 2006). Furthermore, great diversities appearing in the copyright limitation and exception rules in different

(16)

national laws increase the risks. For example, the scope of “fair dealing” in the UK is much narrower than “fair use” in the US, as the former has no general exception of the later (Cornish and etc., 2003a). Moreover, the “private use” exceptions in the civil law countries is much common than it in the common law countries, as the civil law countries respect the intelligence in the work rather than the exploitation benefits in it. On these two grounds, the Internet libraries can not firmly rely upon the limitations and exceptions to lawfully access to, reproduce, even redistribute as the exceptions of individual national legislations are diverse and, under some circumstances, unpredictable. Even though, ignoring the diversities and uncertainties of copyright limitations and exceptions, the copyright exceptions could be applied, the ‘fair use’ or other similar exceptions inevitably undermines the quality of contents because the future uses of the contents are bounded because users of the library can not be sure what the exact authorization scope of the work is.

On the other hand, the simplest solution to reduce such high legal risks is to explicitly examine the copyright of each work, such as getting license from the authors or only collecting public domain works. However, a specific analysis of the copyright of a work is very difficult and needs a lot of human and financial resources. As a result, the number of the libraries belonging to this model is quite limited.

1.3 Possible Ways to solve the Copyright Problem

Instead of expensive human intervention, there are other two main possible useful ways to avoid the potential copyright infringement allegations (Lessig, 2006a): the first approach is definitely the law. For example, a government can grant a totally new copyright exemption which only applies to the Internet library or, directly amplify the reach of fair use exemption. The next useful way is the code. In the context of the Internet, the code, which, more specifically, is software or hardware, forms cyberspace what it is and constitutes a set of constraints on how you can behave (Lessig, 2006b). On this ground, designing a new software robot which can precisely

(17)

identify the authorization scope is a possibly useful way to reduce the risk on copyright infringement allegations.

Even though these two ways can both effectively solve the copyright infringement problem. However, a new exemption may inevitably conflict with the present rules of copyright laws; therefore, it is not a proper choice for the unpredicted consequences. Moreover, a new exemption needs a lot of researches and discussions; in other words, it is very time-consuming. On the other hand, in general, the change in architecture of the Internet may be fiscally cheaper than granting a new exemption, because the process of getting a segment of new code is much easier.

On all the reasons above, employing software robots to automatically collect works, including copyrighted and out-of-copyrighted works, and identifying the explicit authorization scope of the collected works is the best strategy for an Internet library. That is to say, a library belongs to this model, in quadrant IV, could achieve the goal of broadest collections as well as facing a low risk of copyright infringements.

Nevertheless, this mixed strategy is nothing more than an ideal one in the current time and, in fact, no Internet library so far could launch a software robot with an ability to automatically collect works as well as explicitly identify copyright authorization scope. In fact, just few software robots are able to differentiate between a copyrighted document and a document that has been posted by an author for general use; as a result, they simply automatically retrieve all papers via the Internet.

Some technical hurdles actually impede the advance of the Internet library, especially in respect of the ability to automatically identify authorization scope: The first one is: the real meaning of such information, especially in terms of the legal meaning, is not easy to understand without human beings interferences. To speak more explicitly, there are two kinds of difficulties involved: at first, the information, especially expressed in natural language, could not be perfectly identified and comprehended by software robots. Consequently, the misunderstandings by software robots could inevitably lead a misjudgment of the copyright authorization scope. Secondly, the vague expression could also result in some misunderstandings. For

(18)

example, a common jargon “Under Copyright Law Protection”, without specifically indicating under which nation’s copyright laws, may mislead software robots and result in ambiguities to some extents.

The second difficulty is that, even though the right meaning of authorization information could be specifically understood by software robots, the exact location of

the authorization information of a particular work is not easy to be determined. For example, in SourceForge, all programs are under the same GPL license, which is expressing in the “Term of Use” section of the website, as shown in Figure 2.

Figure 2: A snapshot of SourceForge’s “Terms and Conditions of Use”2 On the other hand, every document in Scribd is licensed under the same Creative Common license, as shown in Figure 3. However, as illustrated in these two figures, the locations of the authorization information are different: one is on another page and one is in the same page.

2

(19)

Figure 3: A snapshot of a document in Scribd3

In order to solve the two difficulties above, the first suggestions is offering a much complex software robot: a robot with great artificial intelligence as well as high-level information retrieval technology to find out which piece of information is the real one and to comprehend the legal meaning of information in natural languages. Nevertheless, technologies in these two areas--artificial intelligence and information retrieval, are very complex and, in fact, a software robot with such ability has not existed yet.

Therefore, the next suggestion is offering the authors of the works a mechanism which could be easily understood by the robots, as well as could be used to properly express the copyright authorization scope should be a more practice measure. To speak more explicitly, a mechanism which fulfills two minimum requirements could be used in such circumstances: the first requirement is that the mechanism should be fully identified by software robots and, the second one is that this mechanism should

3

(20)

http://www.scribd.com/doc/3497454/GPL-have flexible ability to express the copyright authorization scope of works, no matters what types of works.

Furthermore, we hope to construct a library not only acquiring collections by software robots, but also focusing on free and open works. The reason is that a library with free and open works can effectively encourage exchanges of all works on the Internet and, as a result, stimulate more developments and reservations of cultures. We hope the mechanisms proposed in this thesis can be useful to achieve this goal.

Based on the foregoing discussion, a fixed term expression, rather than natural languages is a more ideal proposal. Moreover, the popularity of a fixed term expression is very important, because search engines, the most common users of robots, only support several popular fixed-term expressions and this fact will finally decide the number of users of the proposed expressing methods. In other words, a well designed but unpopular fixed-term expression is nothing but an unrealistic imagination.

In the present Internet world, there are two popular fixed-term expressions: the Creative Commons (CC license thereafter) and the Robots.txt and Meta tags. These two mechanisms are dedicatedly designed for software robots, that is to say, any further modification of these two could easily be understood by software robots. More importantly, these two approaches are all supported by popular search engines’ robots, such as Google (Google, 2008b), Yahoo (Yahoo, 2008b), and MSN (MSN, 2008a). However, with regard to expressing the copyright authorization scope, some drawbacks appear to these two schemes: even the CC license covers several common copyright authorization choices, it still have some disadvantages and needs further modifications, especially for works in some kinds of digital forms. On the other hand, the Robots.txt and Meta tags are purposely designed for software robots and very easy to use, but do not focus on expressing explicit copyright authorization. As a result, all these two candidates need some modifications. Furthermore, in respect of the licensing on the Internet, there are two kinds of people in need of expressions of copyright authorization scope. The first one, not surprisingly, is the author of a work. In addition to differently licensing individual works, in the Internet world, the author

(21)

may be in need of licensing all works in one website or a Web page under the same condition. For example, in Scribd, all works are licensed under the same CC license, as shown in Figure 3. Therefore, the second kind of people who need expressions of copyright authorization is the webmasters who operate the websites or the Web page owners who manage the Web page. In general, the site and all pages reside in this site may be owned by the same person; therefore, we use the term, webmaster, to represent the people who are in need of expressions for identically licensing all works. This thesis is structured as follows: at the beginning of this thesis, we will review some primary concepts, such as digital libraries, software robots, the Internet copyright issues and related terminologies. In the next sections, concerning the above-mentioned two kinds of persons, who are in need of authorization expressions, we first try to pay our attention to the webmasters who authorize all works in the same page. The Robot.txt and the CC license as well as the Robots Meta tags can both be used to license works in the same Web page. Nevertheless, the Robot.txt and Robots Meta tags need a minor amendment to fully express the copyright authorization scope, whereas, the CC licensing scheme can be used to license works not only identically but also individually. Nevertheless, a new revision of the original CC license is proposed in the following section, which can reduce the disadvantages of the original CC license in terms of licensing each particular work. Next, we compare the foregoing revision and amendments before finally discussing some unsolved problems while suggesting additional issues that invite future research.

(22)

2. Literature Review and Terminologies Overview

2.1 Internet Library

The concept or definition of a digital library varies in respect of different perspectives. In respect of technology adopted in a digital library, a digital library may be defined as follows:

Digital Libraries basically store materials in electronic format and manipulate large collections of those materials effectively. Research into digital libraries is research into network information systems, concentrating on how to develop the necessary infrastructure to effectively mass-manipulate the information on the Net. (National Science Foundation, 1999)

This definition was crtisized for putting it weight on merely technical aspects. (Seadle and etc., 2007). On the other hand, as regarding importance of the orginazation underlying the collections and computer systems where collections resided, a digital library could be:

Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities. (Digital Library Federation, 1998)

Based on this definition, a digital library and a digital archive are two different in terms of the nature of works collected and preservation functions. (Digital Library Federation, 1998)

On the other hand, Internet Archive is created as a repository of websites. Several aggregating projects, including Google, MSN, Yahoo, Internet Archive (Internet Archive, 2009) and several foreign national libraries have regularly taken snapshots subset of the Internet. In order to allow access when the original page temporarily

(23)

inaccessible, or allow viewers to compare changes made to pages during a specific period, some commercial search engines, such as Google, MSN, and Yahoo, display search results always includes a link to its own cached copy, which is a temporary repository consisting all source codes of indexed websites (Field, 2006).

Although, from a librarian’s professional perspectives, the features of an Internet library do not only embrace digital contents and access via the Internet; other facilities, such as online assistances and comprehensive online references are essential as well (Jones, 2001). On this ground, an Internet archive does not quailed as a digital library. However, the introducing of new technologies, such as search engines and new search algorithms blur the line between them in some degrees; that is why these two terms may appears in an totally equivalent form in some cases, such as “The Internet Archive was founded in 1996 to build an “Internet library” that will offer permanent access for researchers and scholars to historical collections that exist in digital format (Feldman, 2004).” In fact, in terms of the digitalized contents and easy accessibility via the Internet, an Internet and an Internet library could be generally considered as a same term. In addition, as shown in the following sections, both Internet libraries as well as Internet archives face the same legal menace: copyright infringement and, the resolution in terms of this threat are identical. Therefore, in this thesis, it is not necessary to make difference between the ‘Internet library’ and the ‘Internet archive’ and we use ‘Internet library’ to commonly represent them both and a further explanation will be render in some special circumstances.

2.2 Internet Copyright

2.2.1 The diversities and harmonization of international copyright

law

In order to design a comprehensive copyright authorization scheme in respect of software robots, the most important and fundamental work is studying what are the essential components of the copyright in the internet context, especially with regard to

(24)

the software robots accessing. In the context of internet, the accesses of software robots is boundless, that is to say, most of the accesses may cross the national boarders. From this standpoint, the authorization of copyright inevitably involves copyright legislations of more than one country. Therefore, the “copyright law” we have to study here is not limited in national legislations of any specific country, rather, is the international copyright laws.

The first critical fact that we have to notice is that copyright legislations in different nations are different, as many other fields of laws. The basic ideas, philosophy and principles of the same term “copyright” are quite different in different countries. In terms of the basic ideas behind the copyright, the worldwide copyright legislations can be generally classified into two separate systems: the author’s right and copyright right systems. The civil law countries, such as France and Germen, consider the author’s personality expressed in the work constitute the basic interest which should be respect and protect. On the other side, the common law countries, such as the UK, focus on the economic exploitation interest in the work, rather than the personality of author. Based on these separate basic ideas of copyright, there are several diversities between them which are important in the context of internet. For example, a work must be “original” is the same basic requirement in these two systems, however, the criteria of “original” is different, at least in theory. In author’s right countries, in respect the personality, a copyrighted work should represent the creation or intelligence of the author. In copyright countries, however, the traditional standard of “originality” is only sufficient “investment of money, time, and labor”, regardless of creation or intelligence (Sterling, J.A.L., 2003c).

With the increasing advert of interchange of the world, especially with the rapid growth of internet, the diversity in international copyright legislation is gradually deemed as some kinds of hurdles which may become an impede of the information society. As a result, many international treaties, such as Berne Convention, WIPO Copyright Treaty, WIPO Performance and Phonograms Treaty, appear to harmonize and reduce the differences of copyright legislations between different countries. The basic infrastructure constructed by those international treaties provides us a well basic

(25)

scheme which we can use to analyze and discuss the substantial contents of copyright in relation to the software robots’ access and authorization.

2.2.2 The rights within the term “copyrights”

As the “copyright” is not a single right; instead, it can be seen as a set of rights which, according to author, can be generally classified under the headings of “moral right” and “economic right” (Sterling, 2003d). In general, the moral rights are those which relate to the protection for the personality of the author as expressed in their creations (Cornish and etc., 2003b). Economic rights, on the other hand, are those concerning control over the commercial or industrial exploitation of works, and other means of use of the works which involve such acts as reproduction or representation, but do not of themselves necessarily involve prejudice to the reputation of the author or the integrity of the work (Sterling, 2003d). In the internet environment, the main rights which may be infringed are: moral rights and related economic rights, including rights concerned with reproduction and adaptation and rights concerned with communication to the public (Sterling, 2003e).

We have to notice that not only the economic right taking an important part in the copyright infringement on the internet, with the appearance of information aggregation service, but the moral rights gradually play more significant roles. The most recent noticeable-worthy case is a Belgium case: Copiepresse v Google (Copispresse, 2007). In this case, the plaintiff Copiepresse is the representation of some Belgium French newspapers, who assert that one of the services provide by Google, the Google news, infringes the copyright of the Belgium newspapers. The software robots of Google news, retrieved the titles of those papers and, revised the titles and published them on the website of Google news, without the writers’ consents. Based on this fact, instead of alleging the infringement of economic rights, the plaintiff claimed that the paternity right and integrity right are be infringed as well. The court of the first instance agreed the allegation about moral rights and, however, this case is still in appeal (Copiepresse, 2009).

(26)

To sum up, while the copyright legislation in different countries quite diverse; however, in light of the international or regional conventions and treaties, we still can draw a basic scope of the economic and moral rights, which can be seen as the essentials of the copyright. Firstly, in respect of the adaptation/modification right and reproduction right arise little controversy, even in concern with the “transient copying”. On the other hand, the legal meanings of distribution right are different in the US and other countries. However, we can generally use the term “distribution/communication right”, which combing the “communication rights” defined in the WIPO treaties and the “distribution right” in the US, to represent the right of authors to control the dissemination of the works on the internet. Secondly, with regard to the moral rights, the paternity right and the integrity right are two commonly recognized moral rights and the other three moral right, the divulgation, the retraction right and the deconstruction right, are only partly recognized. However, according to the inalienability of the moral rights, the authorization scheme is not necessary to the moral rights.

2.3 The Software Robots

A software robot, also called a spider, crawler, Web robot, Web agent, Webbot, wanderer, and worm, can be defined as a software program issued by its user that traverses the Web to collect data in compliance with standard HTTP protocol (Cheong, 1996). In the beginning of the process, a software robot will follow the initial URLs provided by user to retrieve the Websites. After parsing these collected pages, the robot will obtain more URLs and it can access to more pages consequently. Repeating this process over and over, a software robot will, theoretically, find most of the pages on the Web. Software robots have been shown to be useful in various Web applications. There are four main areas where robots have been widely used (Chau and etc., 2003). The first is “Building collections”: software robots have been extensively used to access and collect data of websites that are required to create an index for application programs, such as search engines. The second use is “Archiving”: a few projects, like Citeseer (Citeseer, 1997), have tried to archive academic papers with regard to computer science from across the whole Web. The

(27)

third is “Personal search”: a personal robot tries to search for websites of interest to a particular user. The final use is for “Web statistics”: the large number of pages collected by robots is often used to provide useful, interesting statistics about the Web, including the total number of distinct websites on the Web (Netcraft, 2008), the average size of a HTML document etc. The complete process of how a robot collects data from the Internet is shown in the following diagram:

Figure 4: The process of how a software robot works

In this diagram, the first step involved is “accessing”, where the robot users use their robots to collect data. Step two is “processing”, where the robot offers the collected data for further processing, such as indexing, analysis, etc. As well as these two steps, some robot users, such as search engines or online archives, may provide the processed data to other online viewers in a last “distributing” step, but the last step is optional. website Robot User Internet website Browser Internet Internet Software Robots Program website Step2 Step1 Step3 Accessing Processing Distributing

(28)

2.4 The DRM and Other Related Measures to Control

Copyright

The digital right management (DRM hereafter) refers to technologies employed by right owners and devices manufactures to control, to restrict or manage the use of the works4. Although the right here does not limit to the copyright, the control and management of the copyright are the main parts involving in DRM. Some opponents allege that the use of the word "rights" is misleading and suggest that people should use the term Digital Restrictions Management to show its essential features (Free Software Foundation, 2006).

With regarding to the components of DRM, the authoring policy expression is one of the key components and, in fact, is a main challenge to implement DRM (LaMacchia, 2002). As a result, tools which can explicitly express the scope of rights granted to the users are very essential to implementation of DRM. In respect of the set of Rights Expression Language (REL), ODRL (Open Digital Rights Language) is an XML-based standard REL and can be adopted to describe the rights granted to the user (ODRL Initiative, 2009). However, although ODRL can used to express the CC license as well (ODRL Initiative, 2005), is still belongs to DRM family; the “open” here only refers to that it is an “open” standard or an “open source” project, not refers to the works licensed by it are open.

In addition to ODRL, a similar tool available to authors or publishers to control and manage the use of works is Digital Object Identifier (DOI hereafter). The International DOI Foundation (IDF) defines DOI as "a digital identifier for any object of intellectual property"; further, it explains that the DOI is used for "persistently identifying a piece of intellectual property on a digital network and associating it with related current data in a structured extensible way.” (International DOI Foundation, 2008) Though DOI can be used to assist authors or publishers to implement his copyrights to their works as well (Rosenblatt, 1997), we have to notice that getting a new DOI is not “free”; an administrative fee is paid for each allocation by the agency to the IDF. As a result, it is not a proper tool for open works.

4

(29)

2.5 Open Content and Free Culture

Rather than control and management, someone believes that free use and exchange of works can actually stimulate and encourage more culture developments. People believing in this idea are really opposed to DRM, because the control and management led by DRM totally contradict the basic principle of “open content” and “free culture”. However, the ideal of “open content” or “free culture” promoted by the groups of those people is only a vague concept; there are several practical varieties derived from this basic principle.

Free Software is one the earliest movement which not only influences the free culture but also the development of the software industry. Basically, free software shall grant users freedom to run, copy, distribute, study, change and improve the software (Free Software Foundation, 1999). To embody the concept of Free Software, several licenses are introduced. The most widely spread one may be the GNU General Public License (GPL hereafter), which allows users to run, copy and distribute the software, but users shall license their modification subject to the same conditions (Free Software Foundation, 2007). In addition to the GPL, the Berkeley Software Distribution License, which grants users almost every right, is another popular free software license (Open Source Initiative, 2006).

Even the free software movements evolves and grows rapidly, some difficulties still impede the further developments of it. The most obvious one is that all licenses are only fixed to the software; other kinds of works, such as images, audio works, are not embraced in the realm of any free software license. Moreover, the diversities of copyright laws lead the uncertainties of the real legal meaning of terms in copyright laws. For example, the “freedom of distribution” in the GPL, mainly based on the US copyright laws, needs some explanations when applying in other jurisdictions. Furthermore, the free software licenses basically ask the authors left their copyrights and such inflexibility actually affects its popularity to some extents. On these grounds, Lawrence Lessig, a law professor in Stanford University, designed and promoted a new licensing scheme, Creative Commons, which allows and encourages authors to

(30)

grant their several baseline rights to others. The details of this license scheme will be explicitly rendered in the next section.

With regarding to the scholar works, Open Access is another branch based on the above “free and open” ideal. There are a variety of definitions of "open access;" in fact, this concept is still evolving with the development of Internet and free culture. However, the following definition, based on the "Budapest Open Access Initiative" (BOAI), is the most influential one to this day (Budapest Open Access Initiative, 2002):

The literature that should be freely accessible online is that which scholars give to the world without expectation of payment. Primarily, this category encompasses their peer-reviewed journal articles, but it also includes any unreviewed preprints that they might wish to put online for comment or to alert colleagues to important research findings. There are many degrees and kinds of wider and easier access to this literature. By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. . . .

Several key points of Open Access can be derived from this popular definition: the first one is that the literatures should be freely available. The second one is that users should access to the works via Internet; that is to say, all works should be digitalized. The third essential element is all works should be only for academic uses. The last one is about the copyright laws: works subjects to Open Access are still under copyright law protections. Users are fully permitted to freely copy and distribute the works, apart from the requirements of proper attribution of the author and the assurance of the integrity of the work (Bailey, 2006). On the other hand, Open Content Alliance is

(31)

a consortium of nonprofit organizations focus on digitizing several works without copyright protection and permitting users freely accessing to the digital contents via Internet. (Open Content Alliance, 2009; O'Leary, 2009)

(32)

3. Expressions for Licensing All Works in a Website

In this section, we will introduce two schemes, the CC license as well as the Robots.txt and Robots Meta tags, which can be used by webmasters to indicate the copyright authorization scope of the works in a website. Furthermore, some amendments will also be introduced to reduce the disadvantages of the Robots.txt and Robots Meta tags in terms of expressing copyright licensing.

3.1 Creative Commons License Framework

3.1.1 The basic of the CC license

The CC license is a license for the purpose of granting some or all of the authors’ rights to the public. The CC license is not limited to software or documents. This license is designed for a broad range of contents, including but not limited to documents, animation files, and other types of information objects. The CC license is popular on the web now. The number of the documents licensed under the CC license and known as the CC licensed documents has been increasing in recent years. One significant boost to the CC licensing is Google’s and Yahoo’s inclusion of support to allow users to search only CC licensed documents (Google, 2007c; Yahoo, 2007). These two systems combined process nearly 80 percent of English language queries worldwide, these companies’ support has been a positive step forward for the CC license (ClikZ Network, 2007).

The Creative Commons (CC) is an organization which designed the CC license5. It gives authors a way to grant some or all of their copyrights to the public. The first CC licenses appeared in December 2002. The guiding principle of the CC license is to complement copyright law rather than competing with it (Lessig, 2004).

The present CC license can be used in a wide variety of works, including audio, video, images, and texts. There are three ways to express a CC license: the first way is called the “Commons Deed” which is a set of basic, human-readable, plain-language

5

(33)

icons that states what a user may do with the content. The second way is called the “Legal Code”, which is an authentication document with formal and explicit legal terms. The “Legal Code” always draws up the clear scope of licensing for the work. The third option is the “Digital Code, which consisting of lines of machine-readable metadata or a “digital signature” of the license. A software robot can process these metadata and tags the document as governed by the CC license. The key point is that an author may use one of these ways, or mix and match them to suit the author’s needs. Table II shows an example of CC license of a document in all three ways.

Table II. An example of the three formats of CC license

Commons Deed6 Legal Code7 Digital Code8

In respect of the scope of copyright authorization, to simply speaking, the CC license has four options: Attribution (by)9, No Derivative Works (nd)10, Share Alike (sa)11 and No Commercial Use (nc)12. The characteristics and meanings of these four options are shown in the following table:

Table III: Four options in CC license13

Options Abbreviation Icons Characteristics and Meanings

6 http://creativecommons.org/licenses/by-sa/3.0/us/ 7 http://creativecommons.org/licenses/by-sa/3.0/us/legalcode 8 http://creativecommons.org/license/work-html-popup?license_code=by-nc 9 http://creativecommons.org/licenses/by/3.0/ 10 http://creativecommons.org/licenses/by-nd/3.0/ 11 http://creativecommons.org/licenses/by-sa/3.0/ 12 http://creativecommons.org/licenses/by-nc/3.0 13 http://creativecommons.org/about/licenses

(34)

Attribution By The licensee must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

No Derivative Works Nd The licensee may not alter,

transform, or build upon this work.

Share Alike Sa If the licensee alters, transforms, or

builds upon this work, he/she may distribute the resulting work only under the same, similar or a compatible license.

No Commercial Use Nc The licensee may not use this work

for commercial purposes.

These four conditions can be combined to form six available different choices shown in the following table14:

Table IV: Six different choices of the CC license15

CC licenses Abbreviation Icons

Attribution By

Attribution, No Derivative Works by-nd

Attribution, No Commercial Use, No Derivative Works

by-nc-nd

Attribution, No Commercial Use by-nc

Attribution, Share Alike by-sa

Attribution, No Commercial Use, Share Alike by-nc-sa

14

http://creativecommons.org/licenses/

15

(35)

To explore the legal meanings of the four options, the first one, the Attribution (by) just emphasizes the importance and the inalienable characteristic of author’s moral rights again. The second and the third options, No Derivative Works (nd) and Share Alike (sa) both connect to the modification right: the former barely prohibits any modification and the later permits further modification under some certain conditions. The last option, No Commercial Use (nc), only indicates one critical licensing condition, with no relation to any specific copyright rights. In light of the legal meanings of the four options, apart from the moral rights and the modification right, it can be seen that activities involving in the other two major economic rights, the reproduction and distribution/communication rights should be subject to the conditions set by the four basic options. For example, under the “Attribution, No Commercial Use” license, the licensee could not reproduce and disseminate the works for commercial purposes16.

The CC license is considered much easier to use and understand than other licenses, like GPL (Lin et al, 2006). In addition, the CC license’s official Web site provides an online license software “wizard” to help authors to choose the most appropriate license. The author answers three questions about the rights they want to grant17.

3.1.2 How to use the CC license in different countries

Based on the above discussions and analysis about the copyright legislations in different countries, we may easily found out that the philosophies, structures and scopes are quite diverse, even more, the same term in two different legislations, such as “distribution right”, represents varied meanings. In the context of Internet, such differences give arise difficulties with the exercise of copyrights, including both economic rights and moral rights, especially when some rights are recognized in some countries and not in others. The introduction of international conventions may lessen such inconsistence; however, the guidelines proposed in the international conventions

16

http://creativecommons.org/licenses/by-nc/2.0/uk/legalcode

17

(36)

are quite limited. From this perspective, the designer of a copyright authorization schemes with regard to software robots may have two options:

In light of the previous section, in different countries, the copyrights with respect to licensing, like modification, distribution and reproduction rights, almost have the same legal meanings. On this ground, the first one option is only dealing with the minimum copyright rights which provided in the international conventions. This approach may meet the basic requirement of a copyright authorization scheme, but cannot satisfy the needs in some complicated situations. Another main drawback of this approach is that legal interpretation is still inevitable while cross board conflicts appear.

Another approach, on the other hand, is giving up providing a solid tool, rather, trying to provide a distinct license in respect of different jurisdictions. The CC license adopts the second approach. In fact, it tries to use different licenses or legal terms in different countries to port 6 basic licenses the various licenses to accommodate local copyright and private law. For example, the legal codes of the same Attribution license are quite different in Hong Kong and England, as shown in Table V. To sum up, through different legal codes to substantially explain the real licensing scope, it can generally be said that the CC license framework provide a set of relatively good tools with regard to fully expressing diverse copyright authorization scopes.

Table V: Different legal code of the same Attribution license in two jurisdictions Legal Code in Hong Kong18 Legal Code in England19

18

http://creativecommons.org/licenses/by/3.0/hk/legalcode

19

(37)

3.1.3 How to license and mark works with CC license

After choosing one of the six different CC licenses, the next, and most technical step is adopting appropriate ways to mark the work to let others understand which license has been chosen and, what the scope of authorization is. The methods of makers are various in respects of the types of works.

The most common way is that a CC marker, a line or graphic stating CC license, should be on the work or papering somewhere near the work, such as embedded in a Website to indicate that all works in this website are under CC licensed20. An ideal CC marker should contain the Commons Deed and a full URL21. A full URL is necessary as the Deed can not show the specific jurisdiction of the license. This general method are almost suitable for any type of works, including text, image, audio, video files and, even physical medias22. An example of a CC maker is as shown in the following figure: 20 http://creativecommons.org.tw/static/technology/webpage 21 http://wiki.creativecommons.org/Marking#Crediting_in_Images 22 http://wiki.creativecommons.org/Marking_Audio

(38)

Figure 5: A sample image with the CC license marker23

On the other hand, there are several other various ways for different types of works. For example, for audio works, a brief sound clip, or an “audio bumper”, at the beginning or end of the work, consisting of the name of the license, the full URL link to licesne and a copyright notice stating the author’s name, date and copyright information is also an effective way24. A video bumper is a visual notice, which often is embedded at the beginning or end of the video work which includes the similar information of the audio bumper25. Moreover, as to longer plain text works, a full segment of legal code embedded within the work can replace the combination of the common deed and a full URL26.

But we have to keep in minds that there are three ways to state a same CC license: the deed, the legal code and the digital-code. In general, the digital-code of the CC license takes the form of HTML tags embedded in the body of a CC licensed

23 http://wiki.creativecommons.org/Marking_Image 24 http://wiki.creativecommons.org/Marking_Audio 25 http://wiki.creativecommons.org/Marking_Vedio 26 http://wiki.creativecommons.org/Marking_Text

(39)

document27. The following example shows a section of the digital-code for the “Attribution-Noncommercial-Share Alike” CC license:

<a rel="license"

href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/"> <img alt="Creative Commons License" style="border-width:0"

src="http://i.creativecommons.org/l/by-nc-sa/3.0/us/88x31.png" /> </a>

<br />This work is licensed under a <a rel="license"

href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>.

In the upcoming codes, “by-nc-sa” in the link

http://creativecommons.org/licenses/by-nc-sa/3.0/us/ expresses that this webpage is authorized under the CC license “Attribution, No Commercial Use and Share Alike” condition. A webmaster can directly embed this segment of HTML codes in his page or website to illustrate the authorization scope of the works in this site.

The following figure demonstrates a part of a Web page where the segment of above code is embedded:

Figure 6: A part of a Web page containing identical licensing information Explicitly speaking, the above part of a page can be divided into two parts: the first one is the deeds and the second one is an URL which links to the Web page containing all essential information of the specific license, including the meanings of the deeds and a link pointing to the legal code. For a robot, the second part is more important, because it contains all necessary licensing information. We will explain this point in the section 4.4.2.

27

http://wiki.creativecommons.org/Frequently_Asked_Questions#What_is_the_Commons_Deed.3F_Wh at_is_the_legal_code.3F_What_does_the_html.2Fmetadata_do.3F

(40)

However, the syntax of digital-code in CC is too complex for people to write. Based on the sample of digital-code above, non-programmers will be baffled by the syntax in the code snippet. In fact, CC’s designers are aware of this issue. The CC license’s Web site provides a user-friendly tool that can generate the needed digital-code. Once the code has been produced, an author needs to cut and paste the generated digital-code into their files. The syntax of the CC license code is meant for an indexing subsystem, not a human. Some humans may be uncomfortable with the extra step the CC license system requires to place the needed instructions in a document file.

In order to overcome the difficulty above; in fact, the CC website offers users a simple tool to generate the digital-code:

(41)

Figure 7: A tool offered by the CC website to generate digital-code28

As shown in the above figure, a page owner has to answer three necessary questions: the former two are about the authorization conditions and the third one is about the jurisdiction. The following figure is the resulting page where the page owner can find out the generated digital-code and further instructions which teach the owner how to embed the code in his page. .

28

(42)

Figure 8: The results of CC digital-code generator29

3.2 The Robots.txt and Robots Meta tags with regard to

Copyright Authorization Expression

The Robots.txt and Robots Meta tags were both proposed in 1990s. The Robots.txt is also called the “Robots Exclusion Protocol” (Snyder and etc. 1998), “Robot Exclusion Standard” (Koster, 1995) or “Standard for Robot Exclusion”(Koster, 1994), though it is only a widely accepted convention consented by members of a robot mailing list (Koster, 1994), rather than an official standard with necessary official recognition (Feigin, 2004). Even so, most wide spread search engines, Google (Google, 2008b), Yahoo (Yahoo, 2008b), and MSN (MSN, 2008a) all support the Robots.txt and Robots Meta tags; moreover, both Yahoo (Yahoo, 2008c) and MSN (MSN, 2008b) have tried to introduce some amendments to them. As far as websites’ are concerned, research indicates that, in 2001, around 40% of the websites owned by the global high-rank companies adopted the Robots.txt and Robots Meta tags (Drott, 2002).

3.2.1 Introduction of Robots.txt

The Robots.txt is a file which should reside in the root directory and must be named "robots.txt". A robots.txt file located in a subdirectory or named as something else is invalid, as software robots only check for this file in the root (Koster, 1994). The following examples illustrate several common uses of the Robots.txt:

Table VI. Examples about Robots.txt

Examples Meaning

1 User-agent: * Disallow:

Allow all robots complete access

29

(43)

2 User-agent: * Disallow:/

Exclude all robots from accessing the entire server

3 User-agent: lycra Disallow:

User-agent: * Disallow:/

Only exclude the access from the robot called “lycra”

4 User-agent: * Disallow: /tmp Disallow:/log

Exclude all robots from the /tmp and the /log folder.

3.2.2 The introduction of Robots Meta tags

Sometimes, the page creators do not administer their own websites. For example, a staff member in a university creates his personal webpage on the website of his department. In this circumstance, it is someone who works in the computer center of the university that is the webmaster having the authority to access the root; the staff member is neither able to access the root directory nor use the Robots.txt to exclude software robots. This disadvantage has been improved by the use of Robots Meta tags: the “[No]index” tag and “[No]follow” tag, which should be within the page codes (Koster, 1997). Some examples are as follows:

Table VII. Examples about Robots Meta tags

Examples Meaning

1 <Meta Name=”MY_ROBOTS” content=”noindex”>

Restrict the software robot called “MY_ROBOT” from indexing a page 2 < Meta Name=”ROBOTS”

content=”noindex”>

Restrict the all robots from indexing a page

3 < Meta Name=”MY_ROBOTS” content=”nofollow”>

Restrict MY_ROBOT following links on a page

4 < Meta Name=”ROBOTS” content=”noindex,nofollow”>

Block all robots from both indexing and following links

(44)

In case the page creator has the right of access to the root directory, he can adopt the single “Disallow” directive to exclude robots, instead of exhaustively embedding redundant “Noindex” tags in all pages hosted in the same server.

3.2.3 Two functions of Robots.txt and Robots Meta tags

3.2.3.1 The original function: voluntary advice

The original idea of the Robot.txt and Robots Meta tags is to offer a common facility provided by the majority of robot authors to the Internet community to protect websites against unwanted access from their robots (Koster, 1994). They are not “enforced by anybody and no guarantee that all current and future robots will use them” (Koster, 1994). In other words, in respect of this design concept, the Robot.txt and Robots Meta tags are only a voluntary code; no one will be punished for breaching the access policy.

3.2.3.2 The new function: expressing online copyright authorization

Apart from mere advice, based on a recent noticeable US federal case, Field v. Google, Inc. (Field, 2006), the Robot.txt and Robots Meta tags have both found their new roles. This case related to the “Cached link” of Google. In order to allow access when the original page is temporarily inaccessible, or allow viewers to compare changes made to pages during a specific period, Google’s search results always includes a link to its own cached copy, which is a temporary repository consisting all source codes of indexed websites (Field, 2006). The plaintiff, Mr. Field, who posted 51 copyright works on his website and “created a robots.txt file for his site, and set the permissions ... to allow all robots to visit and index all of the pages on the site” (Field, 2006) and, with the knowledge of using Robots Meta tags could “instruct Google not to provide Cached link to a given Web page”, Mr. Field consciously decided to use none of them (Field, 2006). As a predictable result, Google routinely used its software robot, GoogleBot (Google, 2008a), to retrieve the plaintiff’s website, indexed his works and provided the Cache link as well as the search results. Based on these facts, Mr. Field “alleges that Google directly infringed

(45)

his copyright when a Google user clicked on the Cached link to the Web pages containing Field's copyrighted works and downloaded a copy of those pages from Google's computers” (Field, 2006). After taking into account the fact that Mr. Field did not take any measure, even though he had the opportunity and ability to employ the “Robots.txt” and Robots Meta tags to exclude any possible software robots or to instruct the search engine to not provide the “cached link”, the federal district court in Nevada held, since Mr. Field “knows the use” and “encourage it”, that he has granted an implied license to Google according to his conscious silence (Sieman, 2007). As a result, Google did not infringe Mr. Field’s copyright at all (Field, 2006).

It is notice that the court in this case suggested that the license from absence of the Robots Meta tags based on two facts: the first one is that, based on the fact that the defendant actually set the Robots.txt, accordingly, Mr. Field, had fully ability and opportunity to employ the tags to prevent Google and, a more important one, Google will stop indexing the websites in terms of the tags employed by the webmasters (Google, 2008a). That is to say, without the above two conditions, a mere absence of the tags could not directly induce an implied license. On this ground, in a recent Belgian case, Copiepresse v Google (Copispresse, 2007), the court found that the newspaper publishers' failure to use standard technical exclusion methods such as the “Robots.txt” and Robots Meta tags did not amount to an implied license (Smith, 2007).

No matter the absence of the tags can be seen as a implied license, according to the forthcoming cases, we can make a conclusion that, although the original idea of Robot.txt and Robots Meta tags was to set up a code of voluntary advice, based on these verdicts, it is quite clear that the Robots.txt and Robots Meta tags have been far from the “voluntary recommendations without any enforcement”; and they have their new roles in the context of law. A webmaster who adopts the Robot.txt or Robots Meta tags to set permissions to allow robots to visit should absolutely be regarded as granting a license to robots, on the other hand, a webmaster who adopts the “Disallow” directive or the “Noindex” tag should be regarded as expressing his explicit wish to exclude the robots; in addition, a webmaster who “consciously” does

數據

Table I. A summary of the four models in risks, costs and amounts of collections
Figure 2: A snapshot of SourceForge’s “Terms and Conditions of Use” 2 On the other hand, every document in Scribd is licensed under the same Creative Common license, as shown in Figure 3
Figure 3: A snapshot of a document in Scribd 3
Figure 4: The process of how a software robot works
+7

參考文獻

相關文件

Cowell, The Jātaka, or Stories of the Buddha's Former Births, Book XXII, pp.

In particular, we present a linear-time algorithm for the k-tuple total domination problem for graphs in which each block is a clique, a cycle or a complete bipartite graph,

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =&gt;

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most

Monopolies in synchronous distributed systems (Peleg 1998; Peleg

Corollary 13.3. For, if C is simple and lies in D, the function f is analytic at each point interior to and on C; so we apply the Cauchy-Goursat theorem directly. On the other hand,

Corollary 13.3. For, if C is simple and lies in D, the function f is analytic at each point interior to and on C; so we apply the Cauchy-Goursat theorem directly. On the other hand,