PREPRINT
國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University
www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 19.pdf
ROC Representation for the Discriminability of Multi-Classification Markers
Yun-Jhong Wu and Chin-Tsang Chiang
December 30, 2011
ROC Representation for the Discriminability of Mutli-Classification Markers
Yun-Jhong Wu and Chin-Tsang Chiang
Department of Mathematics, National Taiwan University December 30, 2011
Abstract
For a multi-classification problem, this article presents that only receiver operat- ing characteristic (ROC) representation and its induced measures are the well-defined assessments of the discriminability of markers. With the convexity and compactness of the performance set, a parameterization system is further employed to characterize corresponding optimal ROC manifolds. Further, its connection with the decision space gives the computational simplicity of the manifolds, and some practically-meaningful geometric features of optimal ROC manifolds are stressed. These enable us to illustrate that a proper ROC subspace is the sufficient and necessary condition for the existence of the hypervolume under the ROC manifolds (HUM). This work provides working sci- entists with an extension of ROC analysis to multi-classification task in a theoretically sound manner.
Index Terms: Discriminability, Hypervolume, Manifold, Optimal classification, Receiver operating characteristic, Utility.
1 Introduction
Receiver operating characteristic (ROC) curves are a popular measure to assess performance of binary classification procedure and have extended to ROC surfaces for ternary or ROC manifolds for general multi-classification (See [1], [2], and [3]). However, ROC surfaces and
manifolds are generally ill-posed for multi-classification tasks due to its loose definition and concern about its existence. The analog of the area under the ROC curves (AUC) for multi- classification, named the hypervolume under the ROC manifolds (HUM) has mentioned in literature, but its existence is in doubt. To clarify the problems and demonstrate its applied importance, we propose a theoretical unification of ROC manifolds.
Extension of ROC analysis to multi-classification has developed initially in sequential classification procedures, which is of excited interest for its practical and theoretical sim- plicity. This algorithm simplifies a multi-classification task to a series of binary classifi- cation as the form G = k versus G ∈ {k + 1, . . . , K} by order k = 1, . . . , K. The first systematic study of ternary ROC can be traced back to the paper of [4]. For a univari- ate marker, the author constructed ROC manifold for ternary classifications to visualize a space spanned by {p1σ(1)( bG, Y ), p2σ(2)( bG, Y ), p3σ(3)( bG, Y )}, where σ is a permutation func- tion, pjk( bG, Y ) := P ( bG(Y ) = j|G(Y ) = k) with bG deterministic sequential classifiers and Y a univariate marker, and introduced the HUM (or volume under the ROC surface, VUS, in some literatures regarding ternary classification). By utilizing a metric between each G and Y, [1] developed a classification rule to accommodate a multivariate marker. Although their work actually provided a perspective to extend ROC analysis into multi-classification problems, this sequential evaluation is of limited applicability and, as we indicate in this article, its non-optimality could lead to lack of explainability.
Indeed, ROC analysis gives an illuminating insight about assessing the discriminability of markers. For a typical K-classification task, a pair of a classifier and a marker ( bG, Y) is usually considered, where bG is a function mapping a general classification marker Y to a distribution with support {1, . . . , K}. It is rational to adopt performance probabilities pjk( bG, Y)’s to draw a comparison. A performance p = (p11, p12, . . . , p1K, p21, . . . , pKK)> can
be plotted in a general ROC space
R = {p ∈ ⊗Kj,k=1[0, 1]j,k :
K
X
k=1
pjk = 1, 0 ≤ pjk ≤ 1},
where ⊗ denotes the Cartesian product operator. The space R is the smallest space suffi- ciently representing all possible performances. Since usually not all pjk are simultaneously at issue for working scientists, in this article an index set S of interesting pjk’s is used to indicate a ROC subspace RS. Similarly, operators or sets subscripted by S mean that they are restricted in RS. As it is well-known in binary classification tasks, performances of a series of classifiers in R{p11,p12}, which form, but not necessarily, a ROC curve, can be plotted as a representation of the discriminability of classification procedures. However, the type of representation is not straightforward for arbitrary K-classification tasks. The analysis in this work begins with concept of proper assessments of discriminability of markers that naturally brings up ROC representation.
2 ROC Representation
Performance of a classification procedure can be easily represented by a function of per- formance probabilities f (p( bG, Y)). To assess the discriminability of Y, in the sense of
“fairness”, the evaluation should depend only on markers, i.e. has the form f (Y), which is invariant of classifiers we choose. We will show that the performance-probability-based evaluation is equivalent to the ROC representation.
Figure 1: Performance set φ(C) for binary classification (Each point denotes a performance of Y with respect to one classifier.)
2.1 Performance Sets
To further simplify algebraic manipulations, pjk( bG, Y) for a fixed Y is denoted briefly by pjk( bG), and the performance function is defined as
φ( bG) = (p11( bG), p12( bG), . . . , p1K( bG), p21( bG), . . . , pKK( bG))>,
in which it still depends on bG. To eliminate the role of employed-classifiers in an accuracy measure of discriminability of Y, it is rational to consider the performance set
φ(C) = {φ( bG) : bG ∈ C},
where the set C consists of all deterministic and randomized classifiers (see Figure 1). The set contains performance of all existing classifiers with respect to one given Y and conveys all information about classification capacity of a specified marker. As a representation of the discriminating ability, φ(C) depends only on Y. The following theorem gives simplification of the representation.
Theorem 2.1. A performance set φ(C) is compact and convex, and so is φS(C) for any S.
To characterize a set with convexity and compactness, it suffices to simply portray its boundary set because features of the set can be determined by it. Thus, one can just estimate the boundary set, rather than the whole performance set or an arbitrary subset of φ(C). After elucidating the perspective of optimality, we show that this boundary set is related to optimality of classifiers because again of its convexity.
2.2 Parametrized Optimal ROC Manifolds
A parametrization system would be helpful for us to analyze and compute the boundary of the performance set ∂φ(C) in theory and practice. Intuitively, a classifier bG is considered as better than another if it associates with higher true probabilities or lower false probabilities.
Thus, classifiers with partially ranked by a dominance relation are naturally introduced.
Definition 2.1. A classifier bG1 dominates another classifier bG2 in S, denoted by bG1 S Gb2, if pkk( bG1) ≥ pkk( bG2) and pjk( bG1) ≤ pjk( bG2) for all pkk and pjk ∈ S with j 6= k. A classifier Gb1 strictly dominates another classifier bG2 in S, denoted by bG1 S Gb2, if at least one of the above inequalities is strict.
A classifier is customarily said admissible if no classifier strictly dominates it. With the dominance relation as a partial ordering on C, the compactness of φS(C) ensures that each chain bG1 S Gb2 S Gb3 S · · · has a greatest element in φ(C) and, by Zorn’s lemma (See e.g. [5]), all classifiers in the chain are dominated by an admissible classifier. It is easy to see that the performance of an admissible classifier belongs to the boundary ∂φS(C) of φS(C) relative to RS. In theoretical development, a parametrized representation of ∂φS(C) would be convenient to explore the properties of the set of performance of admissible classifiers. At the same time, practitioners should be intrigued by the form of these admissible classifiers.
Fortunately, maximization of expected utility, an optimization criterion, is instrumental to simultaneously achieve the theoretical and practical interests.
The utility of bG, as in the decision theory, can be defined as U ( bG) = P
j,kujk1( bG = j, G = k) and its expectation is
E[U ( bG)] =X
j,k
ujkP ( bG = j, G = k), (1)
where the utility values ujk’s satisfy ukk ≥ 0 and ujk ≤ 0 for j 6= k. We note that the positive parameters pk = P (G = k), k = 1, . . . , K, can be absorbed by ujk’s and, hence, the expected utility of a classifier is automatically simplified as u>φ( bG), where u = (u11, u12, · · · , u1K, u21, · · · , uKK)>. Since maximizing u>φ( bG) is equivalent to maxi- mizing (cu)>φ( bG) for all c > 0, the condition kuk = 1 is made to obtain scale invariance. To locate u in a subspace containing φ(C) − K−11K2, a further constraintPK
j=1ujk = K−1 for all j = 1, . . . , K, is imposed. Thus, the standardized u will include K2− K − 1 free utility values. Particularly, in RS, ujk’s are naturally set to be zero for all pjk ∈ S and the number/ of free utility values will reduce to
#S − #{k : {p1k, . . . , pKk} ⊂ S} − 1.
Interestingly, utility is synonymous of the negative Bayes risk with positive and negative utility values being treated as gain and loss, respectively, in classification. For any given u, we can find a corresponding classifier bGu, said to be a utility classifier, with the expected utility
E[U ( bGu)] = sup
G∈Cb
u>φ( bG).
(a) M{p11,p22,p33} (b) M{p12,p23,p31}
Figure 2: Examples of optimal ROC manifolds for ternary classification
As expected, the convexity and the compactness of φ(C) imply that u>φ(C) is a closed inter- val, which leads to φ( bGu) ∈ ∂φ(C) and the existence of bGu. To justify the characterization of φ( bGu) as that of performance of admissible classifiers, it remains to establish the equivalence between maximization of expected utility criterion and admissibility.
Theorem 2.2. A classifier bG is admissible in S if and only if it is a utility classifier in RS. With the implication of Theorem 2.1, the optimal ROC manifold
MS := {φS( bG) : bG is admissible in S.}
fully represents the disciminability of a multi-classification marker (see Figure 2 for exam- ples). Before giving more characterization of parameterized optimal ROC manifolds, we illustrate its interpretability and computability in terms of decision spaces.
Remark 2.1. In the convex analysis, Minkowski’s functional is another way to represent the boundary of a convex set. It also has been employed to construct (non-optimal) ROC
manifolds (e.g. [6]). With this implement, the set ∂φS(C) can also be parameterized as
∂φS(C)(p) = p sup{c : ca + 1
n1 ∈ φS(C)} + 1 n1,
where p is a unit vector in RS. Although this approach can give a concise illustration for the manifoldness of ∂φ(C), Minkowski’s functional could not be directly affiliated with some figurative meaning of statistics yet and is inconvenient to capture the local structure of optimal ROC manifolds. For this purpose, it still relies on its relation with the support function, that requires more algebraic work.
2.3 Connection with Decision Space
The decision space, spanned by likelihood ratios, has been mentioned in some applied fields (e.g. [7]) and gives computational simplicity of ROC manifolds. Let fk(y) denote the density of function of Y given G = k and L(y) = (L1K(y), . . . , L(K−1)K(y))T with Ljk(y) = fj(y)/fk(y). In addition, the sets {y ∈ Y : LkK(y) = c}, k = 1, . . . , K − 1, are assumed to have measure zero for all c > 0.
From the expected utility in (1), we can derive an explicit form of optimal classifiers with an argument slightly different from [8]. By using the equality U ( bG) =PK
k=1U ( bG)1( bG(Y) = k) and iterative expectation, the expected utility of bG can also be expressed as
E[U ( bG)] = E[
K
X
k=1
E[U ( bG)1( bG(Y) = k)|Y]].
This decomposition facilitates to construct a utility classifier via maximizing conditional
expected utility pointwisely over Y. For each y ∈ Y, we set P ( bG(Y) = k|Y = y) = 1 if
E[U ( bG)1( bG(Y) = k)|Y = y] ≥ max
1≤j≤KE[U ( bG)1( bG(Y) = j)|Y = y]. (2) By using (1) and absorbing pi into uki’s, the inequality in (2) can be rewritten as
1≤j≤Kmin
K
X
i=1
(uki− uji)LiK(y) ≥ 0.
Thus, the utility classifier bGu satisfies P ( bGu(Y) = k|Y = y) = 1 if
L(y) ∈ Dk(u) =\
j6=k
{L(y) :
K
X
i=1
(uki− uji)LiK(y) ≥ 0, y ∈ Y}. (3)
It is clear that {Dk(u)}Kk=1 is a partition of the space spanned by the likelihood ratio scores L(Y) and the intersection ∩j6=k{L(y) : PK
i=1(uki − uji)piLiK(y) = 0, y ∈ Y} is a critical point c(u). Such a decision space D has been proposed by researchers in psychology and radiology to describe classifiers. Once ascertaining the fact that all admissible classifiers can be manifested as a combination of linear classifiers in the decision space, it should be highlighted that the transformation L(Y) is an optimal marker for K-classification.
It seems that the first sight of utilizing the decision space to express classifiers might convey some complexities, but these doubts can be fully clarified. First, even if Dk(u)’s in (3) are overlapping, the intersection is a subset of ∂Dk(u) with probability measure zero. Thus, optimal classifiers are still well-defined. Second, the dimensionality of the decision space would be lower than that of original markers; in fact, the minimal sufficiency of the statistic L(Y) for (Y, G) ensures the invariance of performance of classifiers, which is evidenced by
the fact
pjk( bG) = E[E[P ( bG(Y) = j|Y)|L(Y)]|G = k] (4)
= E[P ( bG(Y) = j|L(Y))|G = k].
It follows from (4) that there exists a corresponding bG∗(L(Y)) with the same performance as G(Y). Finally, the decision space can be extended to include infinite values; the conditionalb distribution of Y given G are easily transformed into the generalized decision space even without the assumption of common support on {fk}Kk=1.
From Theorem 2.2, a classifier with maximum true probabilities can be shown to be a utility classifier with ujk = 0 for all j 6= k and vice versa. For a non-degenerate case with ukk > 0, one can further simplify (3) as
Dk(u) = \
j6=k
{L(y) : Ljk(y) ≥ ukk ujj
, y ∈ Y}
with an explicit critical point c(u) = uKK · (u−111, . . . , u−1(K−1)(K−1))>. Practically, it is easier to use c(u) as K − 1 threshold values in D to represent or to index an optimal classifier when S = {pkσ(k)}Kk=1, which ensures that c(u) is a bijective function of u.
3 Characterization of Optimal ROC Manifolds
Admissibility involves global optimality; to make a comparison based on a specific pjk ∈ S, an admissible classifier in S, if it exists, has the highest pij( bG) for j = k or lowest pjk( bG) for j 6= k among all classifiers with fixed values of other performance probabilities in S. In the geometric perspective, φS( bG) is the highest or the lowest point on the domain spanned by S\{pjk}. As one can see, the optimality of classifiers shares theoretical and practical
importance. Some researchers have worked on construction of ROC manifolds; however, without optimality in classifiers, the so-called ROC manifold could be an arbitrary subset of φS(C) rather than a manifold in the context of geometry. Therefore, few features of the ROC manifold sets could be pinpointed, and estimation of ROC manifold sets and related summary measures might lead to an ambiguous and more complicated situation. With this motivation, we introduce optimal ROC manifolds for multi-classification as an extension of optimal ROC curves for binary classification.
For this problem, our first strategy is to show that a set of performance of admissible classifiers is a manifold. Roughly speaking, the structure of such a set is similar to Euclidean space. A parametric system is further employed to investigate its geometric properties. The developed parametric mechanism for MS is mainly based on expected utility, or supporting functions in the terminology of convex analysis. The hyperplanes in RS can be expressed as
HS(r, u) = {p ∈ RS : uTp = r}.
The set φS(C) is then able to be rewritten as ∪r∈R(HS(r, u) ∩ φS(C)) and r can be treated as the expected utility of classifiers bG’s with φS( bG) ∈ HS(r, u) ∩ φS(C) (See Figure 3. A parametric version of MS is then established as
MS(u) = HS(sup
G∈Cb
uTφS( bG), u) ∩ φS(C), (5)
which is a nonempty set in RS. To reveal structural similarity between MS and Euclidean space, its local structure can be explicitly depicted via showing that MS(u) is a bijective and bicontinuous function from utility values to MS. First, one has P (L(Y) ∈ D\ ∪Kk=1 (Dk(u1) ∩ Dk(u2)) = 0 for the singular case φS( bGu1) = φS( bGu2), u1 6= u2. Such pairs {u1, u2} are at most countably many and, hence, ignorable because of the continuity of
Figure 3: Supporting function/utility as a parametrization system for MS
MS(u). Second, the minimal sufficiency of L(Y) suggests that the set of non-informative markers {L(Y) : bGu(Y) 6= bG∗u(Y)} has probability measure zero for φS( bGu) 6= φS( bG∗u).
Thus, the classifier bGλ = 1(Wλ = 1) bG + 1(Wλ = 0) bG∗ with Wλ ∼ Bernoulli(λ) is not a utility classifier for 0 < λ < 1. These two characteristics confirm the injectivity of MS(u) except at countably many singular points. As for the bicontinuity of MS(u), it is further ensued by its convexity. The continuous differentiability of MS(u) implies that the normal vector u1 of MS at MS(u1) converges to u2 as MS(u1) moves toward MS(u2). It follows that MS−1(u) is continuous except at the singular points. With the parametric system in (5), an optimal ROC manifold MS is indeed a s-dimensional manifold in terms of geometry, where s is the number of free utility values, and has the regularity almost everywhere.
The above parameterization supplies some intrinsic characterization of MS through its convexity and the expression pjk( bGu) = R
L(y)∈Dj(u)fk(y)dy. An optimal ROC manifold, endowed by the convexity of φ(C), is at least Lipschitz continuous; even more, differential structure and smoothness, which means infinite differentiability, of optimal ROC manifolds are stated below.
Theorem 3.1. Suppose that all of the ith-order partial derivatives of fLk’s exist, where fLk
is the density function of L(Y) given G = k. Then, MS(u) is i-differentiable.
It follows from Theorem 3.1 that MS is smooth if and only if fLk’s are smooth. For binary classification tasks, an optimal ROC curve MS∗ with S = {p11, p22} can be expressed as a function, for instance, of p11:
MS∗(p11) = FL2(FL−1
1 (1 − p11)),
where FLk is the cumulative distribution function of L(Y) given G = k, k = 1, 2. This par- ticularly simple form greatly facilitates practitioners to model ROC curves. Even for markers with regularly-used distributions, closed-form expressions of ROC manifolds as a function of some pjk’s seem to be unattainable. In addition, modeling on MS would become intricate for K ≥ 3. Admittedly, MS can be regarded as a well-behaved function MS∗(pS\{pjk}), pjk ∈ S, on the domain, which is the projection of φ(C) onto RS\{pjk}. Since classifiers with perfor- mance located in φS(C) ∩ ([0, 1]pjk⊗ p∗) for fixed p∗ ∈ RS\{pjk} form a chain under S, all of them can be shown to be dominated by a unique admissible classifier bG0 with its performance in the same set. Thus, it is straightforward to obtain MS∗(pS\{pjk}) = pjk( bG0). Besides, in practice, researchers might be interested in exploring a trade-off among pjk’s. With our constructed parameterization, the supporting hyperplane HS(ψS(u), u) is a tangent space at MS(u), and a parameterized curve along MS has a tangent vector lying in the tangent space and so is normal to u. Evidenced by the above fact, ∂pjk( bGu)/∂pj0k0( bGu) = −uj0k0/ujk can be treated as the trade-off between pjk and pj0k0 at MS(u).
4 Existence of Hypervolumes
In a RS with the number of considered pjk’s greater than three, optimal ROC manifolds are difficult to be visualized. Thus, a summary index based on optimal ROC manifolds
becomes practically important to draw comparison between markers. Traditionally, the AUC is the most widely-used accuracy measure for binary classification tasks. The HUM for multi-classification, as an analogue of the AUC, has been proposed and facilitated in foregoing literature. However, no clear progress has emerged to answer a radical problem:
the existence of the HUM. In fact, the HUM does not exist or ill-behaved in general.
For binary classification, φ(C) with the dimension 2 separates R{p11,p21}and then grantees the existence of optimal AUC. Similarly, with the continuity of optimal ROC manifolds, it is possible that MSseparates RSin some specific situations of practical interest; the separation is necessary for the set under MSto have a volume form, denoted by VS. The following results about further characterization of optimal ROC manifolds figure out when the HUM can be treated as an appropriate summary index.
Theorem 4.1. For K ≥ 3, suppose that RS contains two coordinates pij and pik for some i, j 6= k, and fLk’s have a common support. Then, there exists a continuous mapping pS : [0, 1] 7→ RS with pS(0) = vec[δj0k0] and pS(1) = vec[1((j0, k0) = (j0, σ(j0)))] for arbitrary σ with σ(k) 6= k such that {pS(t) : t ∈ [0, 1]} ∩ φS(C) = ∅.
Due to the closedness of {(t, pS(t)) : t ∈ [0, 1]} and φ(C), the distance between φS(C) and
∂RS can be shown to be positive. Generally, an optimal ROC manifold cannot enclose a set with finite hypervolume. It follows from Theorem4.1 that both of optimal and non-optimal HUM could be well-defined only if S = {pkσ(k)}Kk=1 for some permutation function σ.
The next focus in this section is the case of degenerate MS. When S = {pkσ(k)}Kk=1 for K ≥ 3 contains both of true and false probabilities, an admissible classifier in S must be of the type pjk( bG) = 0 for j 6= k. More explicitly, the dimension of MS(u) can be shown to be less than K − 1 from the property {φS( bG) ∈ MS} ⊂ {φSe( bG) ∈ MSe} for eS = {pkk ∈ S}.
Thus, MS is unable to create a separation in RS. This conclusion is concretely summarized by the following theorem.
Theorem 4.2. Suppose that S = {p11, . . . , pKK} or S = {pkσ(k) : σ(k) 6= k}Kk=1 for K ≥ 3 and some σ. Then, MS separates RS into two regions.
Indeed, the condition in Theorem 4.2 gives more, namely the essential ingredients of the well-behavior of VS. One might think that HUM can be defined as the hypervolume under MS only on the domain of MS in a sense similar to the partial AUC. Unfortunately, the induced accuracy measure is still problematic and meaningless in practice although this view would circumvent the problem whether MS can actually separate RS. Specifically in RS spanned by all false probabilities, [9] provided an argument to elucidate that the HUMs of perfect and useless markers might both be zero. Further, for arbitrary S, we clearly characterize the condition for the occurrence of this detestable phenomenon and relate it to that in Theorem 4.2.
Theorem 4.3. For K ≥ 3, the HUM VS under MS∗(pS\{pjk}) has the both features (i) (Near perfect marker) VS → 0 as pjk → δjk for all pjk ∈ S;
(ii) (Non-informative marker) VS = 0 as pjk1 = pjk2 for all pjk1, pjk2 ∈ S;
if and only if neither S = {p11, . . . , pKK} nor S = {pkσ(k) : σ(k) 6= k} for some σ.
Thus, the HUM is a rational summary index of the discriminability if and only if perfor- mance probabilities of interest satisfying the conditions in the previous theorem span a ROC subspace.
5 Conclusion
For measure of the discriminability of K-classification markers, this article provides a the- oretical framework to show that a proper measure based on performance probabilities is
exactly the corresponding optimal ROC manifolds. Through parameterization of utility- maximization criterion, optimal ROC manifolds are verified as manifolds in the sense of dif- ferential geometry. This ensures some practically desirable features such as smoothness and differentiability and could support work directly in modeling of ROC manifolds. Moreover, we gives the sufficient and necessary conditions for the existence and well-behavior of HUM.
In practice, this justifies the usefulness of HUM as a summary index for the discriminability of markers when researchers are especially interested in some performance probabilities with respect to a suitable ROC subspace. We believe that this article established a scientific groundwork for further development in multi-classification analysis and a more general ROC analysis.
A Appendix: Proof of Theorems
A.1 Proof of Theorem 2.1
Proof. For arbitrary two classifiers bG1and bG2, it follows that bGλis also a classifier. Moreover, the convexity of φ(C) is a direct result of
φ( bGλ) = λφ( bG1) + (1 − λ)φ( bG2).
Let { bGk} be a sequence of classifiers with φ( bGk) converging to some point p0. Since bGk’s have a finite support {1, . . . , K}, for any ε > 0 there exists a positive constant Mε such that
P (sup
k
k( bGk, Y)k > Mε) < ε.
By the Prohorov’s theorem [10] and the dominated convergence theorem, there exists a subsequence {( bGki, Y)} converging in distribution to ( bG0, Y) with φ( bG0) = p0. This fact
further implies the closedness of φ(C). Together with the boundedness of R, the compactness of φ(C) is directly obtained.
A.2 Proof of Theorem 2.2
Proof. It follows from Theorem 2.1 that φS(C) is a convex set. Together with φS( bG) ∈
∂φS(C), there exists a hyperplane containing φS( bG) but no interior point of φS(C). By standardizing the normal vector of the hyperplane as utility u, bG can be treated as a utility classifier. Conversely, suppose not; that is, some bG∗ S Gbu. Since ujkpjk( bG) < ujkpjk( bG∗) for some (j, k), we have u>φS( bG) < u>φS( bG∗), which contradicts that bGu is a utility classifier.
A.3 Proof of Theorem 4.1
Proof. For eS ⊂ S, if {pSe(t) : t ∈ [0, 1]} ∩ φSe(C) = ∅, any pS(t) with the projection pSe(t) onto RSe has no intersection with φS(C). The rest of this proof ignores the degenerate MS
because its dimension is less than K − 1. For these reasons, it suffices to investigate the case of #S = K + 1 with {pkσ(k)}Kk=1 ⊂ S. For some pij and piσ(i) ∈ S with σ(i) 6= j, we define
pS(t) =
1
X
`=0
(−1)`[(1 − 2t)pS(0.5) − 2(` − t)pS(`)]1[0.5`,0.5(1+`))(t)
with pS(0.5) = vec[1 − δi0j0 + (2δi0j0 − 1)1{(i,j),(i,σ(i))}((i0, j0))]. Since fLj and fLσ(i) have a common support, no classifier satisfies piσ(i)( bG) 6= pij( bG) and pi0j( bG) = δi0j for i0 6= i. Thus, {pS(t) : t ∈ [0, 0.5]} ∩ φS(C) = ∅. Similarly, no classifier can be with pij( bG) = δij and piσ(i)( bG) = 1 − δiσ(i). It further implies that {pS(t) : t ∈ [0.5, 1]} ∩ φS(C) = ∅.
A.4 Proof of Theorem 4.3
Proof. Same with the argument in the proof of Theorem 4.1, we only need to consider non- degenerate MS. Suppose that S is not one of the sets stated in the theorem and so there exists {pjk1, pjk2} ⊂ S. Since
{pS : p is under MS} ⊂ {pSe: p is under MSe} ⊗ (⊗p
j0k0∈S\ eS[0, 1]pj0k0)
for eS ⊂ S, it is available to obtain an upper bound of VS by the inequality VS ≤ VSe. Further, one has VS ≤ mink16=k2,pjk1,pjk2∈SV{p
jk1,pjk2}as well as V{p
jk1,pjk2}of a near perfect marker must approach to zero. As for a non-informative marker, we directly obtain that V{pjk1,pjk2} = 0 because pjk1( bG) = pjk2( bG). Thus, VS ≤ V{p
jk1,pjk2} further ascertains that VS = 0.
Conversely, given a prefect marker and S = {p11, . . . , pKK}, the corresponding VS is the hypervolume of a unit K-cube and equal to 1, and VS = 0 for S = {pkσ(k) : σ(k) 6= k}Kk=1. For a non-informative marker, one can calculate VS = 1/K!, the hypervolume under the hyperplane {p∗ ∈ R{p
kσ(k)=1}Kk=1 : PK
k=1p∗kσ(k) = 1} for any S satisfying the condition. This is precisely the assertion of the theorem.
References
[1] D. Mossman, “Three-way ROCs,” Med. Decis. Making, vol. 19, no. 1, pp. 78 –89, Jan.
1999.
[2] S. Dreiseitl, L. Ohno-Machado, and M. Binder, “Comparing three-class diagnostic tests by three-way ROC analysis,” Med. Decis. Making, vol. 20, no. 3, pp. 323–331, Sep. 2000.
[3] X. He, C. Metz, B. Tsui, J. Links, and E. Frey, “Three-class ROC analysis—a decision theoretic approach under the ideal observer framework,” IEEE Trans. on Med. Imag.,
vol. 25, no. 5, pp. 571–581, May 2006.
[4] B. K. Scurfield, “Multiple-event forced-choice tasks in the theory of signal detectability,”
J. Math. Psych., vol. 40, no. 3, pp. 253–269, Sep. 1996.
[5] P. R. Halmos, Naive Set Theory. New York: Springer, 1998.
[6] C. M. Schubert, S. N. Thorsen, and M. E. Oxley, “The ROC manifold for classification systems,” Pattern Recognit., vol. 44, no. 2, pp. 350–362, Feb. 2011.
[7] B. K. Scurfield, “Generalization of the theory of signal detectability to n-event m- dimensional forced-choice tasks,” J. Math. Psych., vol. 42, no. 1, pp. 5–31, Mar. 1998.
[8] D. C. Edwards, C. E. Metz, and M. A. Kupinski, “Ideal observers and optimal ROC hypersurfaces in n-class classification,” IEEE Tans. on Med. Imag., vol. 23, no. 7, pp.
891–895, Jul. 2004.
[9] D. C. Edwards, C. E. Metz, and R. M. Nishikawa, “The hypervolume under the ROC hypersurface of “near-guessing” and “near-perfect” observers in n-class classification tasks,” IEEE Trans. on Med. Imag., vol. 24, no. 3, pp. 293–299, Mar. 2005.
[10] A. W. v. d. Vaart, Asymptotic Statistics. Cambridge: Cambridge University Press, 1998.