ROC Representation for the Discriminability of Multi-Classification Markers

(1)

PREPRINT

國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 19.pdf

ROC Representation for the Discriminability of Multi-Classification Markers

Yun-Jhong Wu and Chin-Tsang Chiang

December 30, 2011

(2)

ROC Representation for the Discriminability of Mutli-Classification Markers

Yun-Jhong Wu and Chin-Tsang Chiang

Department of Mathematics, National Taiwan University December 30, 2011

Abstract

For a multi-classification problem, this article presents that only receiver operating characteristic (ROC) representation and its induced measures are the well-defined assessments of the discriminability of markers. With the convexity and compactness of the performance set, a parameterization system is further employed to characterize corresponding optimal ROC manifolds. Further, its connection with the decision space gives the computational simplicity of the manifolds, and some practically-meaningful geometric features of optimal ROC manifolds are stressed. These enable us to illustrate that a proper ROC subspace is the sufficient and necessary condition for the existence of the hypervolume under the ROC manifolds (HUM). This work provides working scientists with an extension of ROC analysis to multi-classification task in a theoretically sound manner.

Index Terms: Discriminability, Hypervolume, Manifold, Optimal classification, Receiver operating characteristic, Utility.

1 Introduction

Receiver operating characteristic (ROC) curves are a popular measure to assess performance of binary classification procedure and have extended to ROC surfaces for ternary or ROC manifolds for general multi-classification (See [1], [2], and [3]). However, ROC surfaces and

(3)

manifolds are generally ill-posed for multi-classification tasks due to its loose definition and concern about its existence. The analog of the area under the ROC curves (AUC) for multi- classification, named the hypervolume under the ROC manifolds (HUM) has mentioned in literature, but its existence is in doubt. To clarify the problems and demonstrate its applied importance, we propose a theoretical unification of ROC manifolds.

Extension of ROC analysis to multi-classification has developed initially in sequential classification procedures, which is of excited interest for its practical and theoretical simplicity. This algorithm simplifies a multi-classification task to a series of binary classification as the form G = k versus G ∈ {k + 1, . . . , K} by order k = 1, . . . , K. The first systematic study of ternary ROC can be traced back to the paper of [4]. For a univariate marker, the author constructed ROC manifold for ternary classifications to visualize a space spanned by {p_1σ(1)( bG, Y ), p_2σ(2)( bG, Y ), p_3σ(3)( bG, Y )}, where σ is a permutation function, p_jk( bG, Y ) := P ( bG(Y ) = j|G(Y ) = k) with bG deterministic sequential classifiers and Y a univariate marker, and introduced the HUM (or volume under the ROC surface, VUS, in some literatures regarding ternary classification). By utilizing a metric between each G and Y, [1] developed a classification rule to accommodate a multivariate marker. Although their work actually provided a perspective to extend ROC analysis into multi-classification problems, this sequential evaluation is of limited applicability and, as we indicate in this article, its non-optimality could lead to lack of explainability.

Indeed, ROC analysis gives an illuminating insight about assessing the discriminability of markers. For a typical K-classification task, a pair of a classifier and a marker ( bG, Y) is usually considered, where bG is a function mapping a general classification marker Y to a distribution with support {1, . . . , K}. It is rational to adopt performance probabilities pjk( bG, Y)’s to draw a comparison. A performance p = (p11, p12, . . . , p1K, p21, . . . , pKK)^> can

(4)

be plotted in a general ROC space

R = {p ∈ ⊗^K_j,k=1[0, 1]_j,k :

K

X

k=1

p_jk = 1, 0 ≤ p_jk ≤ 1},

where ⊗ denotes the Cartesian product operator. The space R is the smallest space suffi- ciently representing all possible performances. Since usually not all pjk are simultaneously at issue for working scientists, in this article an index set S of interesting p_jk’s is used to indicate a ROC subspace RS. Similarly, operators or sets subscripted by S mean that they are restricted in R_S. As it is well-known in binary classification tasks, performances of a series of classifiers in R{p11,p12}, which form, but not necessarily, a ROC curve, can be plotted as a representation of the discriminability of classification procedures. However, the type of representation is not straightforward for arbitrary K-classification tasks. The analysis in this work begins with concept of proper assessments of discriminability of markers that naturally brings up ROC representation.

2 ROC Representation

Performance of a classification procedure can be easily represented by a function of performance probabilities f (p( bG, Y)). To assess the discriminability of Y, in the sense of

“fairness”, the evaluation should depend only on markers, i.e. has the form f (Y), which is invariant of classifiers we choose. We will show that the performance-probability-based evaluation is equivalent to the ROC representation.

(5)

Figure 1: Performance set φ(C) for binary classification (Each point denotes a performance of Y with respect to one classifier.)

2.1 Performance Sets

To further simplify algebraic manipulations, p_jk( bG, Y) for a fixed Y is denoted briefly by pjk( bG), and the performance function is defined as

φ( bG) = (p₁₁( bG), p₁₂( bG), . . . , p_1K( bG), p₂₁( bG), . . . , p_KK( bG))^>,

in which it still depends on bG. To eliminate the role of employed-classifiers in an accuracy measure of discriminability of Y, it is rational to consider the performance set

φ(C) = {φ( bG) : bG ∈ C},

where the set C consists of all deterministic and randomized classifiers (see Figure 1). The set contains performance of all existing classifiers with respect to one given Y and conveys all information about classification capacity of a specified marker. As a representation of the discriminating ability, φ(C) depends only on Y. The following theorem gives simplification of the representation.

(6)

Theorem 2.1. A performance set φ(C) is compact and convex, and so is φ_S(C) for any S.

To characterize a set with convexity and compactness, it suffices to simply portray its boundary set because features of the set can be determined by it. Thus, one can just estimate the boundary set, rather than the whole performance set or an arbitrary subset of φ(C). After elucidating the perspective of optimality, we show that this boundary set is related to optimality of classifiers because again of its convexity.

2.2 Parametrized Optimal ROC Manifolds

A parametrization system would be helpful for us to analyze and compute the boundary of the performance set ∂φ(C) in theory and practice. Intuitively, a classifier bG is considered as better than another if it associates with higher true probabilities or lower false probabilities.

Thus, classifiers with partially ranked by a dominance relation are naturally introduced.

Definition 2.1. A classifier bG₁ dominates another classifier bG₂ in S, denoted by bG₁ _S Gb₂, if p_kk( bG₁) ≥ p_kk( bG₂) and p_jk( bG₁) ≤ p_jk( bG₂) for all p_kk and p_jk ∈ S with j 6= k. A classifier Gb₁ strictly dominates another classifier bG₂ in S, denoted by bG₁ _S Gb₂, if at least one of the above inequalities is strict.

A classifier is customarily said admissible if no classifier strictly dominates it. With the dominance relation as a partial ordering on C, the compactness of φ_S(C) ensures that each chain bG₁ _S Gb₂ _S Gb₃ _S · · · has a greatest element in φ(C) and, by Zorn’s lemma (See e.g. [5]), all classifiers in the chain are dominated by an admissible classifier. It is easy to see that the performance of an admissible classifier belongs to the boundary ∂φ_S(C) of φ_S(C) relative to R_S. In theoretical development, a parametrized representation of ∂φ_S(C) would be convenient to explore the properties of the set of performance of admissible classifiers. At the same time, practitioners should be intrigued by the form of these admissible classifiers.

(7)

Fortunately, maximization of expected utility, an optimization criterion, is instrumental to simultaneously achieve the theoretical and practical interests.

The utility of bG, as in the decision theory, can be defined as U ( bG) = P

j,ku_jk1( bG = j, G = k) and its expectation is

E[U ( bG)] =X

j,k

u_jkP ( bG = j, G = k), (1)

where the utility values u_jk’s satisfy u_kk ≥ 0 and u_jk ≤ 0 for j 6= k. We note that the positive parameters pk = P (G = k), k = 1, . . . , K, can be absorbed by ujk’s and, hence, the expected utility of a classifier is automatically simplified as u^>φ( bG), where u = (u11, u12, · · · , u1K, u21, · · · , uKK)^>. Since maximizing u^>φ( bG) is equivalent to maximizing (cu)^>φ( bG) for all c > 0, the condition kuk = 1 is made to obtain scale invariance. To locate u in a subspace containing φ(C) − K⁻¹1_K², a further constraintPK

j=1ujk = K⁻¹ for all j = 1, . . . , K, is imposed. Thus, the standardized u will include K²− K − 1 free utility values. Particularly, in RS, ujk’s are naturally set to be zero for all pjk ∈ S and the number/ of free utility values will reduce to

#S − #{k : {p1k, . . . , pKk} ⊂ S} − 1.

Interestingly, utility is synonymous of the negative Bayes risk with positive and negative utility values being treated as gain and loss, respectively, in classification. For any given u, we can find a corresponding classifier bG_u, said to be a utility classifier, with the expected utility

E[U ( bG_u)] = sup

G∈Cb

u^>φ( bG).

(8)

(a) M_{p₁₁_,p₂₂_,p₃₃_} (b) M_{p₁₂_,p₂₃_,p₃₁_}

Figure 2: Examples of optimal ROC manifolds for ternary classification

As expected, the convexity and the compactness of φ(C) imply that u^>φ(C) is a closed inter- val, which leads to φ( bG_u) ∈ ∂φ(C) and the existence of bG_u. To justify the characterization of φ( bG_u) as that of performance of admissible classifiers, it remains to establish the equivalence between maximization of expected utility criterion and admissibility.

Theorem 2.2. A classifier bG is admissible in S if and only if it is a utility classifier in R_S. With the implication of Theorem 2.1, the optimal ROC manifold

M_S := {φ_S( bG) : bG is admissible in S.}

fully represents the disciminability of a multi-classification marker (see Figure 2 for examples). Before giving more characterization of parameterized optimal ROC manifolds, we illustrate its interpretability and computability in terms of decision spaces.

Remark 2.1. In the convex analysis, Minkowski’s functional is another way to represent the boundary of a convex set. It also has been employed to construct (non-optimal) ROC

(9)

manifolds (e.g. [6]). With this implement, the set ∂φ_S(C) can also be parameterized as

∂φ_S(C)(p) = p sup{c : ca + 1

n1 ∈ φ_S(C)} + 1 n1,

where p is a unit vector in R_S. Although this approach can give a concise illustration for the manifoldness of ∂φ(C), Minkowski’s functional could not be directly affiliated with some figurative meaning of statistics yet and is inconvenient to capture the local structure of optimal ROC manifolds. For this purpose, it still relies on its relation with the support function, that requires more algebraic work.

2.3 Connection with Decision Space

The decision space, spanned by likelihood ratios, has been mentioned in some applied fields (e.g. [7]) and gives computational simplicity of ROC manifolds. Let f_k(y) denote the density of function of Y given G = k and L(y) = (L_1K(y), . . . , L_(K−1)K(y))^T with L_jk(y) = f_j(y)/f_k(y). In addition, the sets {y ∈ Y : L_kK(y) = c}, k = 1, . . . , K − 1, are assumed to have measure zero for all c > 0.

From the expected utility in (1), we can derive an explicit form of optimal classifiers with an argument slightly different from [8]. By using the equality U ( bG) =PK

k=1U ( bG)1( bG(Y) = k) and iterative expectation, the expected utility of bG can also be expressed as

E[U ( bG)] = E[

K

X

k=1

E[U ( bG)1( bG(Y) = k)|Y]].

This decomposition facilitates to construct a utility classifier via maximizing conditional

(10)

expected utility pointwisely over Y. For each y ∈ Y, we set P ( bG(Y) = k|Y = y) = 1 if

E[U ( bG)1( bG(Y) = k)|Y = y] ≥ max

1≤j≤KE[U ( bG)1( bG(Y) = j)|Y = y]. (2) By using (1) and absorbing p_i into u_ki’s, the inequality in (2) can be rewritten as

1≤j≤Kmin

K

X

i=1

(u_ki− u_ji)L_iK(y) ≥ 0.

Thus, the utility classifier bGu satisfies P ( bGu(Y) = k|Y = y) = 1 if

L(y) ∈ D_k(u) =\

j6=k

{L(y) :

K

X

i=1

(u_ki− u_ji)L_iK(y) ≥ 0, y ∈ Y}. (3)

It is clear that {D_k(u)}^K_k=1 is a partition of the space spanned by the likelihood ratio scores L(Y) and the intersection ∩j6=k{L(y) : PK

i=1(uki − uji)piLiK(y) = 0, y ∈ Y} is a critical point c(u). Such a decision space D has been proposed by researchers in psychology and radiology to describe classifiers. Once ascertaining the fact that all admissible classifiers can be manifested as a combination of linear classifiers in the decision space, it should be highlighted that the transformation L(Y) is an optimal marker for K-classification.

It seems that the first sight of utilizing the decision space to express classifiers might convey some complexities, but these doubts can be fully clarified. First, even if Dk(u)’s in (3) are overlapping, the intersection is a subset of ∂D_k(u) with probability measure zero. Thus, optimal classifiers are still well-defined. Second, the dimensionality of the decision space would be lower than that of original markers; in fact, the minimal sufficiency of the statistic L(Y) for (Y, G) ensures the invariance of performance of classifiers, which is evidenced by

(11)

the fact

p_jk( bG) = E[E[P ( bG(Y) = j|Y)|L(Y)]|G = k] (4)

= E[P ( bG(Y) = j|L(Y))|G = k].

It follows from (4) that there exists a corresponding bG^∗(L(Y)) with the same performance as G(Y). Finally, the decision space can be extended to include infinite values; the conditionalb distribution of Y given G are easily transformed into the generalized decision space even without the assumption of common support on {f_k}^K_k=1.

From Theorem 2.2, a classifier with maximum true probabilities can be shown to be a utility classifier with u_jk = 0 for all j 6= k and vice versa. For a non-degenerate case with u_kk > 0, one can further simplify (3) as

D_k(u) = \

j6=k

{L(y) : L_jk(y) ≥ u_kk ujj

, y ∈ Y}

with an explicit critical point c(u) = u_KK · (u⁻¹₁₁, . . . , u⁻¹_{(K−1)(K−1)})^>. Practically, it is easier to use c(u) as K − 1 threshold values in D to represent or to index an optimal classifier when S = {p_kσ(k)}^K_k=1, which ensures that c(u) is a bijective function of u.

3 Characterization of Optimal ROC Manifolds

Admissibility involves global optimality; to make a comparison based on a specific pjk ∈ S, an admissible classifier in S, if it exists, has the highest p_ij( bG) for j = k or lowest p_jk( bG) for j 6= k among all classifiers with fixed values of other performance probabilities in S. In the geometric perspective, φ_S( bG) is the highest or the lowest point on the domain spanned by S\{p_jk}. As one can see, the optimality of classifiers shares theoretical and practical

(12)

importance. Some researchers have worked on construction of ROC manifolds; however, without optimality in classifiers, the so-called ROC manifold could be an arbitrary subset of φ_S(C) rather than a manifold in the context of geometry. Therefore, few features of the ROC manifold sets could be pinpointed, and estimation of ROC manifold sets and related summary measures might lead to an ambiguous and more complicated situation. With this motivation, we introduce optimal ROC manifolds for multi-classification as an extension of optimal ROC curves for binary classification.

For this problem, our first strategy is to show that a set of performance of admissible classifiers is a manifold. Roughly speaking, the structure of such a set is similar to Euclidean space. A parametric system is further employed to investigate its geometric properties. The developed parametric mechanism for M_S is mainly based on expected utility, or supporting functions in the terminology of convex analysis. The hyperplanes in RS can be expressed as

H_S(r, u) = {p ∈ R_S : u^Tp = r}.

The set φ_S(C) is then able to be rewritten as ∪_r∈R(H_S(r, u) ∩ φ_S(C)) and r can be treated as the expected utility of classifiers bG’s with φ_S( bG) ∈ H_S(r, u) ∩ φ_S(C) (See Figure 3. A parametric version of M_S is then established as

M_S(u) = H_S(sup

G∈Cb

u^Tφ_S( bG), u) ∩ φ_S(C), (5)

which is a nonempty set in R_S. To reveal structural similarity between M_S and Euclidean space, its local structure can be explicitly depicted via showing that M_S(u) is a bijective and bicontinuous function from utility values to M_S. First, one has P (L(Y) ∈ D\ ∪^K_k=1 (D_k(u₁) ∩ D_k(u₂)) = 0 for the singular case φ_S( bG_u₁) = φ_S( bG_u₂), u₁ 6= u₂. Such pairs {u₁, u₂} are at most countably many and, hence, ignorable because of the continuity of

(13)

Figure 3: Supporting function/utility as a parametrization system for M_S

M_S(u). Second, the minimal sufficiency of L(Y) suggests that the set of non-informative markers {L(Y) : bG_u(Y) 6= bG^∗_u(Y)} has probability measure zero for φ_S( bG_u) 6= φ_S( bG^∗_u).

Thus, the classifier bG_λ = 1(W_λ = 1) bG + 1(W_λ = 0) bG^∗ with W_λ ∼ Bernoulli(λ) is not a utility classifier for 0 < λ < 1. These two characteristics confirm the injectivity of M_S(u) except at countably many singular points. As for the bicontinuity of M_S(u), it is further ensued by its convexity. The continuous differentiability of M_S(u) implies that the normal vector u₁ of M_S at M_S(u₁) converges to u₂ as M_S(u₁) moves toward M_S(u₂). It follows that M_S⁻¹(u) is continuous except at the singular points. With the parametric system in (5), an optimal ROC manifold M_S is indeed a s-dimensional manifold in terms of geometry, where s is the number of free utility values, and has the regularity almost everywhere.

The above parameterization supplies some intrinsic characterization of M_S through its convexity and the expression p_jk( bG_u) = R

L(y)∈Dj(u)f_k(y)dy. An optimal ROC manifold, endowed by the convexity of φ(C), is at least Lipschitz continuous; even more, differential structure and smoothness, which means infinite differentiability, of optimal ROC manifolds are stated below.

Theorem 3.1. Suppose that all of the ith-order partial derivatives of f_L_k’s exist, where f_L_k

(14)

is the density function of L(Y) given G = k. Then, M_S(u) is i-differentiable.

It follows from Theorem 3.1 that M_S is smooth if and only if f_L_k’s are smooth. For binary classification tasks, an optimal ROC curve M_S^∗ with S = {p₁₁, p₂₂} can be expressed as a function, for instance, of p11:

M_S^∗(p₁₁) = F_L₂(F_L⁻¹

1 (1 − p₁₁)),

where F_L_k is the cumulative distribution function of L(Y) given G = k, k = 1, 2. This particularly simple form greatly facilitates practitioners to model ROC curves. Even for markers with regularly-used distributions, closed-form expressions of ROC manifolds as a function of some p_jk’s seem to be unattainable. In addition, modeling on M_S would become intricate for K ≥ 3. Admittedly, M_S can be regarded as a well-behaved function M_S^∗(p_S\{p_jk_}), p_jk ∈ S, on the domain, which is the projection of φ(C) onto R_S\{p_jk}. Since classifiers with performance located in φ_S(C) ∩ ([0, 1]_p_jk⊗ p^∗) for fixed p^∗ ∈ R_S\{p_jk_} form a chain under _S, all of them can be shown to be dominated by a unique admissible classifier bG₀ with its performance in the same set. Thus, it is straightforward to obtain M_S^∗(p_S\{p_jk_}) = p_jk( bG₀). Besides, in practice, researchers might be interested in exploring a trade-off among p_jk’s. With our constructed parameterization, the supporting hyperplane H_S(ψ_S(u), u) is a tangent space at M_S(u), and a parameterized curve along M_S has a tangent vector lying in the tangent space and so is normal to u. Evidenced by the above fact, ∂p_jk( bG_u)/∂p_j⁰_k⁰( bG_u) = −u_j⁰_k⁰/u_jk can be treated as the trade-off between p_jk and p_j⁰_k⁰ at M_S(u).

4 Existence of Hypervolumes

In a R_S with the number of considered p_jk’s greater than three, optimal ROC manifolds are difficult to be visualized. Thus, a summary index based on optimal ROC manifolds

(15)

becomes practically important to draw comparison between markers. Traditionally, the AUC is the most widely-used accuracy measure for binary classification tasks. The HUM for multi-classification, as an analogue of the AUC, has been proposed and facilitated in foregoing literature. However, no clear progress has emerged to answer a radical problem:

the existence of the HUM. In fact, the HUM does not exist or ill-behaved in general.

For binary classification, φ(C) with the dimension 2 separates R{p11,p21}and then grantees the existence of optimal AUC. Similarly, with the continuity of optimal ROC manifolds, it is possible that MSseparates RSin some specific situations of practical interest; the separation is necessary for the set under M_Sto have a volume form, denoted by V_S. The following results about further characterization of optimal ROC manifolds figure out when the HUM can be treated as an appropriate summary index.

Theorem 4.1. For K ≥ 3, suppose that RS contains two coordinates pij and pik for some i, j 6= k, and f_L_k’s have a common support. Then, there exists a continuous mapping pS : [0, 1] 7→ RS with pS(0) = vec[δj⁰k⁰] and pS(1) = vec[1((j⁰, k⁰) = (j⁰, σ(j⁰)))] for arbitrary σ with σ(k) 6= k such that {p_S(t) : t ∈ [0, 1]} ∩ φ_S(C) = ∅.

Due to the closedness of {(t, p_S(t)) : t ∈ [0, 1]} and φ(C), the distance between φ_S(C) and

∂R_S can be shown to be positive. Generally, an optimal ROC manifold cannot enclose a set with finite hypervolume. It follows from Theorem4.1 that both of optimal and non-optimal HUM could be well-defined only if S = {p_kσ(k)}^K_k=1 for some permutation function σ.

The next focus in this section is the case of degenerate M_S. When S = {p_kσ(k)}^K_k=1 for K ≥ 3 contains both of true and false probabilities, an admissible classifier in S must be of the type pjk( bG) = 0 for j 6= k. More explicitly, the dimension of MS(u) can be shown to be less than K − 1 from the property {φ_S( bG) ∈ M_S} ⊂ {φ_S_e( bG) ∈ M_S_e} for eS = {p_kk ∈ S}.

Thus, MS is unable to create a separation in RS. This conclusion is concretely summarized by the following theorem.

(16)

Theorem 4.2. Suppose that S = {p₁₁, . . . , p_KK} or S = {p_kσ(k) : σ(k) 6= k}^K_k=1 for K ≥ 3 and some σ. Then, MS separates RS into two regions.

Indeed, the condition in Theorem 4.2 gives more, namely the essential ingredients of the well-behavior of VS. One might think that HUM can be defined as the hypervolume under M_S only on the domain of M_S in a sense similar to the partial AUC. Unfortunately, the induced accuracy measure is still problematic and meaningless in practice although this view would circumvent the problem whether M_S can actually separate R_S. Specifically in RS spanned by all false probabilities, [9] provided an argument to elucidate that the HUMs of perfect and useless markers might both be zero. Further, for arbitrary S, we clearly characterize the condition for the occurrence of this detestable phenomenon and relate it to that in Theorem 4.2.

Theorem 4.3. For K ≥ 3, the HUM V_S under M_S^∗(p_S\{p_jk_}) has the both features (i) (Near perfect marker) V_S → 0 as p_jk → δ_jk for all p_jk ∈ S;

(ii) (Non-informative marker) V_S = 0 as p_jk₁ = p_jk₂ for all p_jk₁, p_jk₂ ∈ S;

if and only if neither S = {p₁₁, . . . , p_KK} nor S = {p_kσ(k) : σ(k) 6= k} for some σ.

Thus, the HUM is a rational summary index of the discriminability if and only if performance probabilities of interest satisfying the conditions in the previous theorem span a ROC subspace.

5 Conclusion

For measure of the discriminability of K-classification markers, this article provides a theoretical framework to show that a proper measure based on performance probabilities is

(17)

exactly the corresponding optimal ROC manifolds. Through parameterization of utility- maximization criterion, optimal ROC manifolds are verified as manifolds in the sense of differential geometry. This ensures some practically desirable features such as smoothness and differentiability and could support work directly in modeling of ROC manifolds. Moreover, we gives the sufficient and necessary conditions for the existence and well-behavior of HUM.

In practice, this justifies the usefulness of HUM as a summary index for the discriminability of markers when researchers are especially interested in some performance probabilities with respect to a suitable ROC subspace. We believe that this article established a scientific groundwork for further development in multi-classification analysis and a more general ROC analysis.

A Appendix: Proof of Theorems

A.1 Proof of Theorem 2.1

Proof. For arbitrary two classifiers bG₁and bG₂, it follows that bG_λis also a classifier. Moreover, the convexity of φ(C) is a direct result of

φ( bG_λ) = λφ( bG₁) + (1 − λ)φ( bG₂).

Let { bG_k} be a sequence of classifiers with φ( bG_k) converging to some point p₀. Since bG_k’s have a finite support {1, . . . , K}, for any ε > 0 there exists a positive constant M_ε such that

P (sup

k

k( bG_k, Y)k > M_ε) < ε.

By the Prohorov’s theorem [10] and the dominated convergence theorem, there exists a subsequence {( bG_k_i, Y)} converging in distribution to ( bG₀, Y) with φ( bG₀) = p₀. This fact

(18)

further implies the closedness of φ(C). Together with the boundedness of R, the compactness of φ(C) is directly obtained.

A.2 Proof of Theorem 2.2

Proof. It follows from Theorem 2.1 that φ_S(C) is a convex set. Together with φ_S( bG) ∈

∂φS(C), there exists a hyperplane containing φS( bG) but no interior point of φS(C). By standardizing the normal vector of the hyperplane as utility u, bG can be treated as a utility classifier. Conversely, suppose not; that is, some bG^∗ S Gbu. Since ujkpjk( bG) < ujkpjk( bG^∗) for some (j, k), we have u^>φ_S( bG) < u^>φ_S( bG^∗), which contradicts that bG_u is a utility classifier.

A.3 Proof of Theorem 4.1

Proof. For eS ⊂ S, if {p_S_e(t) : t ∈ [0, 1]} ∩ φ_S_e(C) = ∅, any p_S(t) with the projection p_S_e(t) onto R_S_e has no intersection with φS(C). The rest of this proof ignores the degenerate MS

because its dimension is less than K − 1. For these reasons, it suffices to investigate the case of #S = K + 1 with {p_kσ(k)}^K_k=1 ⊂ S. For some pij and p_iσ(i) ∈ S with σ(i) 6= j, we define

p_S(t) =

1

X

`=0

(−1)^`[(1 − 2t)p_S(0.5) − 2(` − t)p_S(`)]1[0.5`,0.5(1+`))(t)

with pS(0.5) = vec[1 − δi⁰j⁰ + (2δi⁰j⁰ − 1)1{(i,j),(i,σ(i))}((i⁰, j⁰))]. Since fLj and fL_σ(i) have a common support, no classifier satisfies p_iσ(i)( bG) 6= p_ij( bG) and p_i⁰_j( bG) = δ_i⁰_j for i⁰ 6= i. Thus, {pS(t) : t ∈ [0, 0.5]} ∩ φS(C) = ∅. Similarly, no classifier can be with pij( bG) = δij and p_iσ(i)( bG) = 1 − δ_iσ(i). It further implies that {p_S(t) : t ∈ [0.5, 1]} ∩ φ_S(C) = ∅.

(19)

A.4 Proof of Theorem 4.3

Proof. Same with the argument in the proof of Theorem 4.1, we only need to consider non- degenerate MS. Suppose that S is not one of the sets stated in the theorem and so there exists {p_jk₁, p_jk₂} ⊂ S. Since

{pS : p is under MS} ⊂ {p_S_e: p is under M_S_e} ⊗ (⊗_p

j0k0∈S\ eS[0, 1]p_j0k0)

for eS ⊂ S, it is available to obtain an upper bound of V_S by the inequality V_S ≤ V_S_e. Further, one has V_S ≤ min_k₁_6=k₂_,p_jk1_,p_jk2_∈SV_{p

jk1,p_jk2}as well as V_{p

jk1,p_jk2}of a near perfect marker must approach to zero. As for a non-informative marker, we directly obtain that V{p_jk1,p_jk2} = 0 because p_jk₁( bG) = p_jk₂( bG). Thus, V_S ≤ V_{p

jk1,p_jk2} further ascertains that V_S = 0.

Conversely, given a prefect marker and S = {p₁₁, . . . , p_KK}, the corresponding V_S is the hypervolume of a unit K-cube and equal to 1, and V_S = 0 for S = {p_kσ(k) : σ(k) 6= k}^K_k=1. For a non-informative marker, one can calculate V_S = 1/K!, the hypervolume under the hyperplane {p^∗ ∈ R_{p

kσ(k)=1}^K_k=1 : PK

k=1p^∗_kσ(k) = 1} for any S satisfying the condition. This is precisely the assertion of the theorem.

References

[1] D. Mossman, “Three-way ROCs,” Med. Decis. Making, vol. 19, no. 1, pp. 78 –89, Jan.

1999.

[2] S. Dreiseitl, L. Ohno-Machado, and M. Binder, “Comparing three-class diagnostic tests by three-way ROC analysis,” Med. Decis. Making, vol. 20, no. 3, pp. 323–331, Sep. 2000.

[3] X. He, C. Metz, B. Tsui, J. Links, and E. Frey, “Three-class ROC analysis—a decision theoretic approach under the ideal observer framework,” IEEE Trans. on Med. Imag.,

(20)

vol. 25, no. 5, pp. 571–581, May 2006.

[4] B. K. Scurfield, “Multiple-event forced-choice tasks in the theory of signal detectability,”

J. Math. Psych., vol. 40, no. 3, pp. 253–269, Sep. 1996.

[5] P. R. Halmos, Naive Set Theory. New York: Springer, 1998.

[6] C. M. Schubert, S. N. Thorsen, and M. E. Oxley, “The ROC manifold for classification systems,” Pattern Recognit., vol. 44, no. 2, pp. 350–362, Feb. 2011.

[7] B. K. Scurfield, “Generalization of the theory of signal detectability to n-event m- dimensional forced-choice tasks,” J. Math. Psych., vol. 42, no. 1, pp. 5–31, Mar. 1998.

[8] D. C. Edwards, C. E. Metz, and M. A. Kupinski, “Ideal observers and optimal ROC hypersurfaces in n-class classification,” IEEE Tans. on Med. Imag., vol. 23, no. 7, pp.

891–895, Jul. 2004.

[9] D. C. Edwards, C. E. Metz, and R. M. Nishikawa, “The hypervolume under the ROC hypersurface of “near-guessing” and “near-perfect” observers in n-class classification tasks,” IEEE Trans. on Med. Imag., vol. 24, no. 3, pp. 293–299, Mar. 2005.

[10] A. W. v. d. Vaart, Asymptotic Statistics. Cambridge: Cambridge University Press, 1998.