libact: Pool-based Active Learning in Python

(1)

libact: Pool-based Active Learning in Python

Yao-Yuan Yang^∗, Shao-Chuan Lee^†, Yu-An Chung^‡, Tung-En Wu^§, Si-An Chen^¶, and Hsuan-Tien Lin^k

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

Abstract

libact is a Python package designed to make active learning easier for general users. The package not only implements several popular active learning strategies, but also features the active-learning-by-learning meta-algorithm that assists the users to automatically select the best strategy on the fly. Furthermore, the package provides a unified interface for implementing more strategies, models and application-specific labelers. The package is open-source on Github, and can be easily installed from Python Package Index repository.

1 Introduction

libact is a Python package that provides an easy-to-use environment for solving active learning problems. In recent years, several machine learning packages like scikit-learn [1] successfully attract many users by providing unified interfaces for accessing a wide range of machine learning algorithms. To the best of our knowledge, there is yet to be a similar package for active learning in Python. libact follows scikit-learn to design a framework of unified interfaces for active learning. The design is non-trivial as active learning problems come with an interactive and more sophisticated learning protocol than traditional unsupervised and supervised learning problems.

In libact, we implement many state-of-the-art active learning algorithms, while leaving room in the interface for extension to other algorithms. Furthermore, We address a common user need of algorithm/parameter selection during active learning by implementing the active-learning-by-learning (ALBL) meta-algorithm [2]. ALBL can smartly validate several different active learning algorithms on the fly, and matches the best of those algorithms in performance. The inclusion of ALBL greatly facilitates the users in terms of automatic algorithm/parameter selection.

∗b01902066@ntu.edu.tw

†b01902010@csie.ntu.edu.tw

‡b01902040@csie.ntu.edu.tw

§r00942129@ntu.edu.tw

¶r05922089@ntu.edu.tw

khtlin@csie.ntu.edu.tw

arXiv:1710.00379v1 [cs.LG] 1 Oct 2017

(2)

Concrete applications, such as a recent paper on medical concept recognition [3], have started adopting libact, which is designed to continuously grow with user needs. We provide a mature development environment that includes an issue tracker, continuous integration, unit testing, automatic code analyzer and document generator. New developers can easily join the libact project to spread future research works on active learning for more users.

2 Interfaces and Usage

We consider a pool-based active learning problem, which consists of a set of labeled examples, a set of unlabeled examples, a supervised learning model, and an labeling oracle [4]. In each iteration of active learning, the algorithm (also called a query strategy) queries the oracle to label an unlabeled example for the model. The goal is to improve the model rapidly with only a few queries. Based on the components above, we designed the following four interfaces for libact, which allow the users to easily try different active learning algorithms, learning models or labeling oracles for their needs.

Dataset. The Dataset class maintains both the labeled and unlabeled examples. The examples are stored in an attribute named data, which is a Python list of (feature, label) tuples. The index of the example in data is used as its identifier. For an unlabeled example, the label field is assigned with None type.

After retrieving the label for a certain example from the oracle, the Dataset.update method takes in the example index and its newly retrieved label, and replaces the original entry of (feature, None) with (feature, retrieved label). The Dataset class also maintains a list of callback functions that are triggered each time an example is updated, and the callback functions are mainly used for active learning algorithms that need to update their internal state after querying the oracle. Users may also register new callback functions with the Dataset.on update method if there are other needs for their applications.

QueryStrategy. The QueryStrategy class is the interface for active learning algorithms.

Each QueryStrategy object is associated with a Dataset object. When a QueryStrategy object is initialized, it will automatically register its QueryStrategy.update method as a callback function to the associated Dataset to be informed of any Dataset updates. The key method that decides the example to be queried is QueryStrategy.make query, which returns the identifier of an unlabeled example. By overriding the key method, libact currently implements a diverse spectrum of algorithms as sub-classes of QueryStrategy:

• Binary Classification: Density Weighted Uncertainty Sampling [5], Hinted Sampling with SVM [HintSVM, 6], QBC [7], QUIRE [8], Random Sampling (as a baseline), Uncertainty Sampling [9, with multi-class support], Variance Reduction [10], and ALBL [for algorithm/parameter selection, as discussed in Section 1, 2]

• Multi-class Classification: Active Learning With Cost Embedding [11, with cost- sensitive support], Hierarchical Sampling [12], Expected Error Reduction [4]

(3)

• Multi-label Classification: Adaptive Active Learning [13], Binary Minimization [14], Maximal Loss Reduction with Maximal Confidence [15], Multi-label AL With Aux- iliary Learner [16]

Model. The Model class is the interface for supervised learning models. The interface mimics the one in scikit-learn [1] to give easy access to many popular learning models implemented in other Python packages. The Model.train method takes the labeled examples from a Dataset and learns from the labeled examples; the Model.predict, Model.predict real and Model.predict proba methods output the label predictions, the confidence levels, and the probability estimates on some feature vectors, respectively;

the SklearnAdapter and SklearnProbaAdapter classes convert scikit-learn models for libact to use.

Labeler. The Labeler class represents the oracle in the given active learning problem and can be easily customized to support different applications. Custom-made Labeler should override the Labeler.label method, which takes in an unlabeled example and returns the retrieved label. As concrete examples, libact include two labelers, IdealLabeler and InteractiveLabler. IdealLabeler simulates what research papers do when conducting experiments: using a fully-labeled dataset as the backbone of the oracle. InteractiveLabler serves as a human computer interface that shows the feature vector of a given example as an image on the screen and fetches the human-provided label through the command line.

Usage. With the interfaces of libact, the high-level usage pseudo code can be shown below.¹ The first four lines declare the necessary components in the problem, and can be replaced with concrete implementations of QueryStrategy, Labeler and Model. Within lines 5 for the quota iterations of pool-based active learning, lines 6 to 9 implement the usual query and update steps.

1 d a t a s e t = D a t a s e t (X, y )

2 q u e r y s t r a t e g y = Q u e r y S t r a t e g y ( d a t a s e t ) 3 l a b e l e r = L a b e l e r ( )

4 model = Model ( )

5 f o r in range ( quota ) :

6 q u e r y i d = q u e r y s t r a t e g y . make query ( )

7 l b l = l a b e l e r . l a b e l ( d a t a s e t . d a t a [ q u e r y i d ] [ 0 ] ) 8 d a t a s e t . update ( q u e r y i d , l b l )

9 model . t r a i n ( d a t a s e t )

The following experiments² demonstrate the usefulness of the ALBL meta-algorithm for algorithm selection on three datasets (heart, australian, diabetes), downloaded from the dataset page of LIBSVM [17]. ALBL is used to choose between Uncertainty Sampling, Random Sampling, QUIRE and HintSVM. We take linear SVM from scikit-learn as the Model. Figure 1 demonstrates that the achieved error rates of ALBL and the best strategy are very close to each other in every dataset, and verifies the usefulness of ALBL for strategy selection.

1See examples/plot.py of libact repository for runnable code.

2See examples/albl plot.py for details.

(4)

(a) heart (b) australian (c) diabetes

Figure 1: Comparison between ALBL and other strategies

3 Package Specialties

The source code of libact is hosted on Github (https://github.com/ntucllab/libact) with an issue tracker so that all users and developers can report problems when using this package. The package is publicly available under a looser form of the BSD license, and can be installed from the Python Package Index repository by pip install libact. The dependencies are listed in requirements.txt to facilitate installation, and API references are written in numpy-style docstrings. Documentation is automatically built with Sphinx and Pygments and hosted on http://libact.readthedocs.org/. We have also written unit tests and set up a continuous integration testing environment on Travis CI (https:

//travis-ci.org/ntucllab/libact). This significantly reduces integration problems and facilitates more rapid project development.

One peer package on active learning is jclal [18], which is a broad-covering Java library that also solves general active learning problems. Other current libraries either comes with a narrower coverage of active learning algorithms [19] or has not been deeply documented [20]. jclal enjoys the specialty of supporting stream-based active learning, along with user-friendly utilities in its library design. On the other hand, libact enjoys the specialty of supporting algorithm/parameter selection (ALBL) and cost-sensitive active learning, along with broader coverage on active learning algorithms for binary and multiclass classification. Also, libact can be easily integrated with the growing machine learning ecosystem in Python.

4 Conclusion

libact is a Python package designed to make active learning easy. It contains interfaces to integrate with applications and other packages, implementations of diverse active learning strategies to achieve decent performance, automatic strategy selection routine via the ALBL meta-algorithm to assist the users, and infrastructure in terms of issue tracker, documentation and continuous integration to keep the package growing.

(5)

5 Acknowledgements

This package is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under award number FA2386-15-1-4012, and by the Ministry of Science and Technology of Taiwan under number MOST 103-2221-E-002-149-MY3.

References

[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research, 12:2825–2830, 2011.

[2] W.-N. Hsu and H.-T. Lin. Active learning by learning. In AAAI, pages 2659–2665, 2015.

[3] G. Stanovsky, D. Gruhl, and P. Mendes. Recognizing mentions of adverse drug reaction in social media using knowledge-infused recurrent models. In EACL, pages 142–151, 2017.

[4] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55- 66):11, 2010.

[5] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In ICML, 2004.

[6] C.-L. Li, C.-S. Ferng, and H.-T. Lin. Active learning using hint information. Neural Computation, 27(8):1738–1765, 2015.

[7] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In COLT, pages 287–294, 1992.

[8] S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative examples. In NIPS, pages 892–900, 2010.

[9] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, pages 3–12, 1994.

[10] A. I. Schein and L. H. Ungar. Active learning for logistic regression: an evaluation.

Machine Learning, 68(3):235–265, 2007.

[11] K.-H. Huang and H.-T. Lin. A novel uncertainty sampling algorithm for cost-sensitive multiclass active learning. In ICDM, pages 925–930, 2016.

[12] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, pages 208–215, 2008.

[13] X. Li and Y. Guo. Active learning with multi-label SVM classification. In IJCAI, 2013.

(6)

[14] K. Brinker. On active learning in multi-label classification. In From Data and Infor- mation Analysis to Knowledge Engineering, pages 206–213. 2006.

[15] B. Yang, J.-T. Sun, T. Wang, and Z. Chen. Effective multi-label active learning for text classification. In KDD, pages 917–926, 2009.

[16] C.-W. Hung and H.-T. Lin. Multi-label active learning with auxiliary learner. In ACML, pages 315–330, 2011.

[17] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.

[18] O. Reyes, E. P´erez, M. del Carmen Rodrıguez-Hern´andez, H. Fardoun, and S. Ventura.

JCLAL: a java framework for active learning. Journal of Machine Learning Research, 17(95):1–5, 2016.

[19] J. Ramey. Active learning in R. https://github.com/ramhiser/activelearning, 2015.

[20] D. Santos and A. Carvalho. Comparison of active learning strategies and proposal of a multiclass hypothesis space search. In Hybrid Artificial Intelligence Systems, pages 618–629. 2014.