• 沒有找到結果。

Chapter 2 Related Works

2.1 Patent classification

Patent classification schemes classify patent documents. In recent years, a considerable number of such schemes have been proposed (e.g., Kim & Choi, 2007; Kohonen, et al., 2000;

Lai & Wu, 2005; Larkey, 1999; Richter & MacFarlane, 2005; Cong & Tong, 2008; Cong &

Loh, 2010; Trappey, et al., 2006). The features extracted from patent documents for classification purposes can be divided into three types: content features, citation information and metadata. The detailed parts of patent documents are showed in Appendix C.

2.1.1 Content-based patent classification

Since patent classification is formulated as a text categorization problem that involves assigning a patent document to the correct class, most studies only consider patent content information to address the problem (e.g., Loh, et al., 2006). In content-based patent classification approaches, the content of a patent document dp is represented by a vector of term weights, 𝑑𝑑����⃗ = 〈𝑤𝑤𝑝𝑝 1𝑝𝑝, … , 𝑤𝑤|𝑇𝑇|𝑝𝑝〉 , where T is the set of terms. The similarity of two patent documents is defined as the cosine value of their term vectors (Yang, 1994). The most popular term weighting function is term frequency / inverse document frequency (tfidf), developed by Salton and Buckley (Salton & Buckley, 1988). It is defined as follows:

𝑡𝑡𝑡𝑡𝑡𝑡𝑑𝑑𝑡𝑡�𝑡𝑡𝑘𝑘, 𝑑𝑑𝑝𝑝� = #(𝑡𝑡𝑘𝑘, 𝑑𝑑𝑝𝑝) × log⁡�𝑁𝑁 𝑛𝑛⁄ 𝑡𝑡𝑘𝑘�,

where #(tk, dp) denotes the number of times term tk occurs in patent document dp (the term frequency);

and lo g�𝑁𝑁 𝑛𝑛⁄ 𝑡𝑡𝑘𝑘�represents the total number of patent documents divided by those in which tk occurs (the inverse document frequency).

The similarity of two patent documents is defined as the cosine value (Yang, 1994) of their respective term vectors, as shown in Eq. 2:

𝑆𝑆𝑡𝑡𝑆𝑆(𝑞𝑞, 𝑝𝑝) =�𝑑𝑑𝑑𝑑𝑞𝑞��∙𝑑𝑑𝑝𝑝��

�����⃗��𝑑𝑑𝑞𝑞 �����⃗�𝑝𝑝

where q is the query patent document to be classified; and p is a patent document in the training patent dataset.

(2) (1)

5

Based on the similarity of patent documents, the kNN classifier selects the k-nearest neighbors of a query patent to predict the class of the patent based on majority vote. The class that most of neighboring patents belong to is chosen as the class of the query patent.

Instead of using the full text of a patent document as the basis for classification, some approaches classify patent documents by considering normative sections, such as the abstract, background, and results (Kim & Choi, 2007; Fall, 2003, 2004; Larkey, 1999; Cong & Tong, 2008; Loh, et al., 2006;

Trappey, et al., 2006). These studies regard the patent document’s abstract as the most informative feature (Larkey, 1999; Liang, et al., 2003; Loh, et al., 2006).

2.1.2 Citation-based patent classification

In real-world applications, patent documents are linked through citations that imply the connections and relationships between the citer and the cited. Approaches that utilize citations have been proposed (Lai & Wu, 2005; Li et al., 2007). These studies demonstrate that citation-based patent classification performs better than content-based classification. In our work, we also consider the citation relationships between patent documents when constructing the patent ontology network.

2.1.2.1 Co-citation patent classification

The co-citation approach (Lai & Wu, 2005) classifies a query patent according to the majority vote of the classes of its cited patents. For example, suppose a query patent cites five documents in the basic patent set. If three of the cited patents belong to class C1 and the other two belong to class C2, the query patent will be assigned to class C1. Note that the co-citation approach uses the grouping result of patents, which are clustered according to the co-citation frequency and linkage strength of each pair of basic patents, as the classes, rather than the well-known UPCs (United States Patent Classification) or IPCs (International Patent Classification).

2.1.2.2 Citation network patent classification

In Li et al.’s (2007) approach, every patent has its own citation network in which each cited node is labeled with its classification class. A patent’s class is determined by evaluating the similarities between its citation networks and those of other patents already classified into UPC categories. The network similarity, or graph similarity, of two patents is calculated by comparing their random walk paths. This approach adopts a three-stage kernel-based technique for patent classification: data acquisition and parsing, kernel construction, and classifier training. Li et al. (2007) use support vector machine (SVM) as the kernel machine. In their approach, the kernel value, namely the patent similarity of a patent pair is calculated as Eq. 3:

𝐾𝐾�𝐺𝐺𝑝𝑝𝑡𝑡, 𝐺𝐺𝑝𝑝𝑝𝑝� = ∑ ∑ 𝑙𝑙h (ℎ, ℎ)𝑂𝑂(ℎ|𝐺𝐺)𝑂𝑂(ℎ|𝐺𝐺) , (3)

6

where Gpi and Gpj represent the citation networks associated with two patents pi and pj; h and h’ are the random walk paths in the respective graphs; and 𝑂𝑂(ℎ|𝐺𝐺) and 𝑂𝑂(ℎ|𝐺𝐺) denote the probability of random walk paths that exist in the citation networks. 𝑙𝑙�ℎ�ℎ� is defined as follows:

𝑙𝑙(ℎ|ℎ) = �1, if ℎ and ℎ are identical 0, otherwise

For each class, the SVM classifier will generate a classification model. The kernel matrix is an augmented matrix which contains patent similarity vectors of all patents in the training set and their respective class labels. The class label of each patent is defined as whether the patent belongs to a specific class—the label is 1 if a patent belongs to the class, and is -1 otherwise.

This is so called one-against-rest model for the SVM to handle multiclass problems. For each specific class, its well-trained SVM model can be used to predict if a query patent belongs to the class. The final class is then determined with winner-takes-all strategy from all these SVM models of classes.

2.1.3 Metadata-based patent classification

Metadata is defined as “information that describes data”. The metadata in a patent document, such as inventors’ names and assignees’ names, may be correlated with the document’s content and can be used for classification purposes. Richter & MacFarlane (2005) showed that patent classification based on a document’s metadata can improve the accuracy of the results. Their approach uses metadata, such as the inventor’s name, the applicant’s name and the IPC code to help classify commercial intellectual property. Because the approach considers text, inventor and IPC metadata simultaneously, it yields a better classification result. Patent documents are mapped into vectors of terms, inventors’

names and IPCs. For the text, the weights of terms are calculated by the tfidf approach (Salton and Buckley, 1988); the weight of each inventor is calculated as �1 #𝑡𝑡𝑛𝑛𝑖𝑖⁄ , where #inv is the total number of inventors of the patent; and the weight of each IPC code is calculated as �1 (#𝑡𝑡𝑝𝑝𝑖𝑖 + 1)⁄ , where #ipc is the number of IPC code assigned to the patent. Note that the primary IPC is weighted twice as high as other IPC assigned to the patent. After compiling the vectors, the similarity between two patent documents can be calculated. The kNN classifier is then used to identify the class of the query patent based on the similarity (cosine value) of patent documents.

One limitation of the above method is that it only works well when the inventors of a query patent also exist in the training set. The method does not utilize indirect relationships to help classify patents developed by new inventors who are not included in the training set. In contrast, our method constructs a patent ontology network; thus, indirect relationships can be used to classify patent documents more flexibly and accurately.

7

相關文件