**PREPRINT**

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

### www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 19.pdf

## ROC Representation for the Discriminability of Multi-Classification Markers

### Yun-Jhong Wu and Chin-Tsang Chiang

### December 30, 2011

### ROC Representation for the Discriminability of Mutli-Classification Markers

### Yun-Jhong Wu and Chin-Tsang Chiang

### Department of Mathematics, National Taiwan University December 30, 2011

Abstract

For a multi-classification problem, this article presents that only receiver operat- ing characteristic (ROC) representation and its induced measures are the well-defined assessments of the discriminability of markers. With the convexity and compactness of the performance set, a parameterization system is further employed to characterize corresponding optimal ROC manifolds. Further, its connection with the decision space gives the computational simplicity of the manifolds, and some practically-meaningful geometric features of optimal ROC manifolds are stressed. These enable us to illustrate that a proper ROC subspace is the sufficient and necessary condition for the existence of the hypervolume under the ROC manifolds (HUM). This work provides working sci- entists with an extension of ROC analysis to multi-classification task in a theoretically sound manner.

Index Terms: Discriminability, Hypervolume, Manifold, Optimal classification, Receiver operating characteristic, Utility.

### 1 Introduction

Receiver operating characteristic (ROC) curves are a popular measure to assess performance of binary classification procedure and have extended to ROC surfaces for ternary or ROC manifolds for general multi-classification (See [1], [2], and [3]). However, ROC surfaces and

manifolds are generally ill-posed for multi-classification tasks due to its loose definition and concern about its existence. The analog of the area under the ROC curves (AUC) for multi- classification, named the hypervolume under the ROC manifolds (HUM) has mentioned in literature, but its existence is in doubt. To clarify the problems and demonstrate its applied importance, we propose a theoretical unification of ROC manifolds.

Extension of ROC analysis to multi-classification has developed initially in sequential
classification procedures, which is of excited interest for its practical and theoretical sim-
plicity. This algorithm simplifies a multi-classification task to a series of binary classifi-
cation as the form G = k versus G ∈ {k + 1, . . . , K} by order k = 1, . . . , K. The first
systematic study of ternary ROC can be traced back to the paper of [4]. For a univari-
ate marker, the author constructed ROC manifold for ternary classifications to visualize a
space spanned by {p_{1σ(1)}( bG, Y ), p_{2σ(2)}( bG, Y ), p_{3σ(3)}( bG, Y )}, where σ is a permutation func-
tion, p_{jk}( bG, Y ) := P ( bG(Y ) = j|G(Y ) = k) with bG deterministic sequential classifiers and
Y a univariate marker, and introduced the HUM (or volume under the ROC surface, VUS,
in some literatures regarding ternary classification). By utilizing a metric between each G
and Y, [1] developed a classification rule to accommodate a multivariate marker. Although
their work actually provided a perspective to extend ROC analysis into multi-classification
problems, this sequential evaluation is of limited applicability and, as we indicate in this
article, its non-optimality could lead to lack of explainability.

Indeed, ROC analysis gives an illuminating insight about assessing the discriminability
of markers. For a typical K-classification task, a pair of a classifier and a marker ( bG, Y)
is usually considered, where bG is a function mapping a general classification marker Y to
a distribution with support {1, . . . , K}. It is rational to adopt performance probabilities
pjk( bG, Y)’s to draw a comparison. A performance p = (p11, p12, . . . , p1K, p21, . . . , pKK)^{>} can

be plotted in a general ROC space

R = {p ∈ ⊗^{K}_{j,k=1}[0, 1]_{j,k} :

K

X

k=1

p_{jk} = 1, 0 ≤ p_{jk} ≤ 1},

where ⊗ denotes the Cartesian product operator. The space R is the smallest space suffi-
ciently representing all possible performances. Since usually not all pjk are simultaneously
at issue for working scientists, in this article an index set S of interesting p_{jk}’s is used to
indicate a ROC subspace RS. Similarly, operators or sets subscripted by S mean that they
are restricted in R_{S}. As it is well-known in binary classification tasks, performances of a
series of classifiers in R{p11,p12}, which form, but not necessarily, a ROC curve, can be plotted
as a representation of the discriminability of classification procedures. However, the type
of representation is not straightforward for arbitrary K-classification tasks. The analysis
in this work begins with concept of proper assessments of discriminability of markers that
naturally brings up ROC representation.

### 2 ROC Representation

Performance of a classification procedure can be easily represented by a function of per- formance probabilities f (p( bG, Y)). To assess the discriminability of Y, in the sense of

“fairness”, the evaluation should depend only on markers, i.e. has the form f (Y), which is invariant of classifiers we choose. We will show that the performance-probability-based evaluation is equivalent to the ROC representation.

Figure 1: Performance set φ(C) for binary classification (Each point denotes a performance of Y with respect to one classifier.)

### 2.1 Performance Sets

To further simplify algebraic manipulations, p_{jk}( bG, Y) for a fixed Y is denoted briefly by
pjk( bG), and the performance function is defined as

φ( bG) = (p_{11}( bG), p_{12}( bG), . . . , p_{1K}( bG), p_{21}( bG), . . . , p_{KK}( bG))^{>},

in which it still depends on bG. To eliminate the role of employed-classifiers in an accuracy measure of discriminability of Y, it is rational to consider the performance set

φ(C) = {φ( bG) : bG ∈ C},

where the set C consists of all deterministic and randomized classifiers (see Figure 1). The set contains performance of all existing classifiers with respect to one given Y and conveys all information about classification capacity of a specified marker. As a representation of the discriminating ability, φ(C) depends only on Y. The following theorem gives simplification of the representation.

Theorem 2.1. A performance set φ(C) is compact and convex, and so is φ_{S}(C) for any S.

To characterize a set with convexity and compactness, it suffices to simply portray its boundary set because features of the set can be determined by it. Thus, one can just estimate the boundary set, rather than the whole performance set or an arbitrary subset of φ(C). After elucidating the perspective of optimality, we show that this boundary set is related to optimality of classifiers because again of its convexity.

### 2.2 Parametrized Optimal ROC Manifolds

A parametrization system would be helpful for us to analyze and compute the boundary of the performance set ∂φ(C) in theory and practice. Intuitively, a classifier bG is considered as better than another if it associates with higher true probabilities or lower false probabilities.

Thus, classifiers with partially ranked by a dominance relation are naturally introduced.

Definition 2.1. A classifier bG_{1} dominates another classifier bG_{2} in S, denoted by bG_{1} _{S} Gb_{2},
if p_{kk}( bG_{1}) ≥ p_{kk}( bG_{2}) and p_{jk}( bG_{1}) ≤ p_{jk}( bG_{2}) for all p_{kk} and p_{jk} ∈ S with j 6= k. A classifier
Gb_{1} strictly dominates another classifier bG_{2} in S, denoted by bG_{1} _{S} Gb_{2}, if at least one of the
above inequalities is strict.

A classifier is customarily said admissible if no classifier strictly dominates it. With the
dominance relation as a partial ordering on C, the compactness of φ_{S}(C) ensures that each
chain bG_{1} _{S} Gb_{2} _{S} Gb_{3} _{S} · · · has a greatest element in φ(C) and, by Zorn’s lemma (See
e.g. [5]), all classifiers in the chain are dominated by an admissible classifier. It is easy to
see that the performance of an admissible classifier belongs to the boundary ∂φ_{S}(C) of φ_{S}(C)
relative to R_{S}. In theoretical development, a parametrized representation of ∂φ_{S}(C) would
be convenient to explore the properties of the set of performance of admissible classifiers. At
the same time, practitioners should be intrigued by the form of these admissible classifiers.

Fortunately, maximization of expected utility, an optimization criterion, is instrumental to simultaneously achieve the theoretical and practical interests.

The utility of bG, as in the decision theory, can be defined as U ( bG) = P

j,ku_{jk}1( bG =
j, G = k) and its expectation is

E[U ( bG)] =X

j,k

u_{jk}P ( bG = j, G = k), (1)

where the utility values u_{jk}’s satisfy u_{kk} ≥ 0 and u_{jk} ≤ 0 for j 6= k. We note that
the positive parameters pk = P (G = k), k = 1, . . . , K, can be absorbed by ujk’s and,
hence, the expected utility of a classifier is automatically simplified as u^{>}φ( bG), where
u = (u11, u12, · · · , u1K, u21, · · · , uKK)^{>}. Since maximizing u^{>}φ( bG) is equivalent to maxi-
mizing (cu)^{>}φ( bG) for all c > 0, the condition kuk = 1 is made to obtain scale invariance. To
locate u in a subspace containing φ(C) − K^{−1}1_{K}^{2}, a further constraintPK

j=1ujk = K^{−1} for
all j = 1, . . . , K, is imposed. Thus, the standardized u will include K^{2}− K − 1 free utility
values. Particularly, in RS, ujk’s are naturally set to be zero for all pjk ∈ S and the number/
of free utility values will reduce to

#S − #{k : {p1k, . . . , pKk} ⊂ S} − 1.

Interestingly, utility is synonymous of the negative Bayes risk with positive and negative
utility values being treated as gain and loss, respectively, in classification. For any given u,
we can find a corresponding classifier bG_{u}, said to be a utility classifier, with the expected
utility

E[U ( bG_{u})] = sup

G∈Cb

u^{>}φ( bG).

(a) M_{{p}_{11}_{,p}_{22}_{,p}_{33}_{}} (b) M_{{p}_{12}_{,p}_{23}_{,p}_{31}_{}}

Figure 2: Examples of optimal ROC manifolds for ternary classification

As expected, the convexity and the compactness of φ(C) imply that u^{>}φ(C) is a closed inter-
val, which leads to φ( bG_{u}) ∈ ∂φ(C) and the existence of bG_{u}. To justify the characterization of
φ( bG_{u}) as that of performance of admissible classifiers, it remains to establish the equivalence
between maximization of expected utility criterion and admissibility.

Theorem 2.2. A classifier bG is admissible in S if and only if it is a utility classifier in R_{S}.
With the implication of Theorem 2.1, the optimal ROC manifold

M_{S} := {φ_{S}( bG) : bG is admissible in S.}

fully represents the disciminability of a multi-classification marker (see Figure 2 for exam- ples). Before giving more characterization of parameterized optimal ROC manifolds, we illustrate its interpretability and computability in terms of decision spaces.

Remark 2.1. In the convex analysis, Minkowski’s functional is another way to represent the boundary of a convex set. It also has been employed to construct (non-optimal) ROC

manifolds (e.g. [6]). With this implement, the set ∂φ_{S}(C) can also be parameterized as

∂φ_{S}(C)(p) = p sup{c : ca + 1

n1 ∈ φ_{S}(C)} + 1
n1,

where p is a unit vector in R_{S}. Although this approach can give a concise illustration
for the manifoldness of ∂φ(C), Minkowski’s functional could not be directly affiliated with
some figurative meaning of statistics yet and is inconvenient to capture the local structure
of optimal ROC manifolds. For this purpose, it still relies on its relation with the support
function, that requires more algebraic work.

### 2.3 Connection with Decision Space

The decision space, spanned by likelihood ratios, has been mentioned in some applied fields
(e.g. [7]) and gives computational simplicity of ROC manifolds. Let f_{k}(y) denote the
density of function of Y given G = k and L(y) = (L_{1K}(y), . . . , L_{(K−1)K}(y))^{T} with L_{jk}(y) =
f_{j}(y)/f_{k}(y). In addition, the sets {y ∈ Y : L_{kK}(y) = c}, k = 1, . . . , K − 1, are assumed to
have measure zero for all c > 0.

From the expected utility in (1), we can derive an explicit form of optimal classifiers with an argument slightly different from [8]. By using the equality U ( bG) =PK

k=1U ( bG)1( bG(Y) = k) and iterative expectation, the expected utility of bG can also be expressed as

E[U ( bG)] = E[

K

X

k=1

E[U ( bG)1( bG(Y) = k)|Y]].

This decomposition facilitates to construct a utility classifier via maximizing conditional

expected utility pointwisely over Y. For each y ∈ Y, we set P ( bG(Y) = k|Y = y) = 1 if

E[U ( bG)1( bG(Y) = k)|Y = y] ≥ max

1≤j≤KE[U ( bG)1( bG(Y) = j)|Y = y]. (2)
By using (1) and absorbing p_{i} into u_{ki}’s, the inequality in (2) can be rewritten as

1≤j≤Kmin

K

X

i=1

(u_{ki}− u_{ji})L_{iK}(y) ≥ 0.

Thus, the utility classifier bGu satisfies P ( bGu(Y) = k|Y = y) = 1 if

L(y) ∈ D_{k}(u) =\

j6=k

{L(y) :

K

X

i=1

(u_{ki}− u_{ji})L_{iK}(y) ≥ 0, y ∈ Y}. (3)

It is clear that {D_{k}(u)}^{K}_{k=1} is a partition of the space spanned by the likelihood ratio scores
L(Y) and the intersection ∩j6=k{L(y) : PK

i=1(uki − uji)piLiK(y) = 0, y ∈ Y} is a critical point c(u). Such a decision space D has been proposed by researchers in psychology and radiology to describe classifiers. Once ascertaining the fact that all admissible classifiers can be manifested as a combination of linear classifiers in the decision space, it should be highlighted that the transformation L(Y) is an optimal marker for K-classification.

It seems that the first sight of utilizing the decision space to express classifiers might
convey some complexities, but these doubts can be fully clarified. First, even if Dk(u)’s in (3)
are overlapping, the intersection is a subset of ∂D_{k}(u) with probability measure zero. Thus,
optimal classifiers are still well-defined. Second, the dimensionality of the decision space
would be lower than that of original markers; in fact, the minimal sufficiency of the statistic
L(Y) for (Y, G) ensures the invariance of performance of classifiers, which is evidenced by

the fact

p_{jk}( bG) = E[E[P ( bG(Y) = j|Y)|L(Y)]|G = k] (4)

= E[P ( bG(Y) = j|L(Y))|G = k].

It follows from (4) that there exists a corresponding bG^{∗}(L(Y)) with the same performance as
G(Y). Finally, the decision space can be extended to include infinite values; the conditionalb
distribution of Y given G are easily transformed into the generalized decision space even
without the assumption of common support on {f_{k}}^{K}_{k=1}.

From Theorem 2.2, a classifier with maximum true probabilities can be shown to be a
utility classifier with u_{jk} = 0 for all j 6= k and vice versa. For a non-degenerate case with
u_{kk} > 0, one can further simplify (3) as

D_{k}(u) = \

j6=k

{L(y) : L_{jk}(y) ≥ u_{kk}
ujj

, y ∈ Y}

with an explicit critical point c(u) = u_{KK} · (u^{−1}_{11}, . . . , u^{−1}_{(K−1)(K−1)})^{>}. Practically, it is easier
to use c(u) as K − 1 threshold values in D to represent or to index an optimal classifier when
S = {p_{kσ(k)}}^{K}_{k=1}, which ensures that c(u) is a bijective function of u.

### 3 Characterization of Optimal ROC Manifolds

Admissibility involves global optimality; to make a comparison based on a specific pjk ∈ S,
an admissible classifier in S, if it exists, has the highest p_{ij}( bG) for j = k or lowest p_{jk}( bG)
for j 6= k among all classifiers with fixed values of other performance probabilities in S. In
the geometric perspective, φ_{S}( bG) is the highest or the lowest point on the domain spanned
by S\{p_{jk}}. As one can see, the optimality of classifiers shares theoretical and practical

importance. Some researchers have worked on construction of ROC manifolds; however,
without optimality in classifiers, the so-called ROC manifold could be an arbitrary subset
of φ_{S}(C) rather than a manifold in the context of geometry. Therefore, few features of the
ROC manifold sets could be pinpointed, and estimation of ROC manifold sets and related
summary measures might lead to an ambiguous and more complicated situation. With this
motivation, we introduce optimal ROC manifolds for multi-classification as an extension of
optimal ROC curves for binary classification.

For this problem, our first strategy is to show that a set of performance of admissible
classifiers is a manifold. Roughly speaking, the structure of such a set is similar to Euclidean
space. A parametric system is further employed to investigate its geometric properties. The
developed parametric mechanism for M_{S} is mainly based on expected utility, or supporting
functions in the terminology of convex analysis. The hyperplanes in RS can be expressed as

H_{S}(r, u) = {p ∈ R_{S} : u^{T}p = r}.

The set φ_{S}(C) is then able to be rewritten as ∪_{r∈R}(H_{S}(r, u) ∩ φ_{S}(C)) and r can be treated
as the expected utility of classifiers bG’s with φ_{S}( bG) ∈ H_{S}(r, u) ∩ φ_{S}(C) (See Figure 3. A
parametric version of M_{S} is then established as

M_{S}(u) = H_{S}(sup

G∈Cb

u^{T}φ_{S}( bG), u) ∩ φ_{S}(C), (5)

which is a nonempty set in R_{S}. To reveal structural similarity between M_{S} and Euclidean
space, its local structure can be explicitly depicted via showing that M_{S}(u) is a bijective
and bicontinuous function from utility values to M_{S}. First, one has P (L(Y) ∈ D\ ∪^{K}_{k=1}
(D_{k}(u_{1}) ∩ D_{k}(u_{2})) = 0 for the singular case φ_{S}( bG_{u}_{1}) = φ_{S}( bG_{u}_{2}), u_{1} 6= u_{2}. Such pairs
{u_{1}, u_{2}} are at most countably many and, hence, ignorable because of the continuity of

Figure 3: Supporting function/utility as a parametrization system for M_{S}

M_{S}(u). Second, the minimal sufficiency of L(Y) suggests that the set of non-informative
markers {L(Y) : bG_{u}(Y) 6= bG^{∗}_{u}(Y)} has probability measure zero for φ_{S}( bG_{u}) 6= φ_{S}( bG^{∗}_{u}).

Thus, the classifier bG_{λ} = 1(W_{λ} = 1) bG + 1(W_{λ} = 0) bG^{∗} with W_{λ} ∼ Bernoulli(λ) is not a
utility classifier for 0 < λ < 1. These two characteristics confirm the injectivity of M_{S}(u)
except at countably many singular points. As for the bicontinuity of M_{S}(u), it is further
ensued by its convexity. The continuous differentiability of M_{S}(u) implies that the normal
vector u_{1} of M_{S} at M_{S}(u_{1}) converges to u_{2} as M_{S}(u_{1}) moves toward M_{S}(u_{2}). It follows that
M_{S}^{−1}(u) is continuous except at the singular points. With the parametric system in (5), an
optimal ROC manifold M_{S} is indeed a s-dimensional manifold in terms of geometry, where
s is the number of free utility values, and has the regularity almost everywhere.

The above parameterization supplies some intrinsic characterization of M_{S} through its
convexity and the expression p_{jk}( bG_{u}) = R

L(y)∈Dj(u)f_{k}(y)dy. An optimal ROC manifold,
endowed by the convexity of φ(C), is at least Lipschitz continuous; even more, differential
structure and smoothness, which means infinite differentiability, of optimal ROC manifolds
are stated below.

Theorem 3.1. Suppose that all of the ith-order partial derivatives of f_{L}_{k}’s exist, where f_{L}_{k}

is the density function of L(Y) given G = k. Then, M_{S}(u) is i-differentiable.

It follows from Theorem 3.1 that M_{S} is smooth if and only if f_{L}_{k}’s are smooth. For binary
classification tasks, an optimal ROC curve M_{S}^{∗} with S = {p_{11}, p_{22}} can be expressed as a
function, for instance, of p11:

M_{S}^{∗}(p_{11}) = F_{L}_{2}(F_{L}^{−1}

1 (1 − p_{11})),

where F_{L}_{k} is the cumulative distribution function of L(Y) given G = k, k = 1, 2. This par-
ticularly simple form greatly facilitates practitioners to model ROC curves. Even for markers
with regularly-used distributions, closed-form expressions of ROC manifolds as a function of
some p_{jk}’s seem to be unattainable. In addition, modeling on M_{S} would become intricate for
K ≥ 3. Admittedly, M_{S} can be regarded as a well-behaved function M_{S}^{∗}(p_{S\{p}_{jk}_{}}), p_{jk} ∈ S,
on the domain, which is the projection of φ(C) onto R_{S\{p}_{jk}}. Since classifiers with perfor-
mance located in φ_{S}(C) ∩ ([0, 1]_{p}_{jk}⊗ p^{∗}) for fixed p^{∗} ∈ R_{S\{p}_{jk}_{}} form a chain under _{S}, all of
them can be shown to be dominated by a unique admissible classifier bG_{0} with its performance
in the same set. Thus, it is straightforward to obtain M_{S}^{∗}(p_{S\{p}_{jk}_{}}) = p_{jk}( bG_{0}). Besides, in
practice, researchers might be interested in exploring a trade-off among p_{jk}’s. With our
constructed parameterization, the supporting hyperplane H_{S}(ψ_{S}(u), u) is a tangent space at
M_{S}(u), and a parameterized curve along M_{S} has a tangent vector lying in the tangent space
and so is normal to u. Evidenced by the above fact, ∂p_{jk}( bG_{u})/∂p_{j}^{0}_{k}^{0}( bG_{u}) = −u_{j}^{0}_{k}^{0}/u_{jk} can
be treated as the trade-off between p_{jk} and p_{j}^{0}_{k}^{0} at M_{S}(u).

### 4 Existence of Hypervolumes

In a R_{S} with the number of considered p_{jk}’s greater than three, optimal ROC manifolds
are difficult to be visualized. Thus, a summary index based on optimal ROC manifolds

becomes practically important to draw comparison between markers. Traditionally, the AUC is the most widely-used accuracy measure for binary classification tasks. The HUM for multi-classification, as an analogue of the AUC, has been proposed and facilitated in foregoing literature. However, no clear progress has emerged to answer a radical problem:

the existence of the HUM. In fact, the HUM does not exist or ill-behaved in general.

For binary classification, φ(C) with the dimension 2 separates R{p11,p21}and then grantees
the existence of optimal AUC. Similarly, with the continuity of optimal ROC manifolds, it is
possible that MSseparates RSin some specific situations of practical interest; the separation
is necessary for the set under M_{S}to have a volume form, denoted by V_{S}. The following results
about further characterization of optimal ROC manifolds figure out when the HUM can be
treated as an appropriate summary index.

Theorem 4.1. For K ≥ 3, suppose that RS contains two coordinates pij and pik for some
i, j 6= k, and f_{L}_{k}’s have a common support. Then, there exists a continuous mapping
pS : [0, 1] 7→ RS with pS(0) = vec[δj^{0}k^{0}] and pS(1) = vec[1((j^{0}, k^{0}) = (j^{0}, σ(j^{0})))] for arbitrary
σ with σ(k) 6= k such that {p_{S}(t) : t ∈ [0, 1]} ∩ φ_{S}(C) = ∅.

Due to the closedness of {(t, p_{S}(t)) : t ∈ [0, 1]} and φ(C), the distance between φ_{S}(C) and

∂R_{S} can be shown to be positive. Generally, an optimal ROC manifold cannot enclose a set
with finite hypervolume. It follows from Theorem4.1 that both of optimal and non-optimal
HUM could be well-defined only if S = {p_{kσ(k)}}^{K}_{k=1} for some permutation function σ.

The next focus in this section is the case of degenerate M_{S}. When S = {p_{kσ(k)}}^{K}_{k=1} for
K ≥ 3 contains both of true and false probabilities, an admissible classifier in S must be of
the type pjk( bG) = 0 for j 6= k. More explicitly, the dimension of MS(u) can be shown to
be less than K − 1 from the property {φ_{S}( bG) ∈ M_{S}} ⊂ {φ_{S}_{e}( bG) ∈ M_{S}_{e}} for eS = {p_{kk} ∈ S}.

Thus, MS is unable to create a separation in RS. This conclusion is concretely summarized by the following theorem.

Theorem 4.2. Suppose that S = {p_{11}, . . . , p_{KK}} or S = {p_{kσ(k)} : σ(k) 6= k}^{K}_{k=1} for K ≥ 3
and some σ. Then, MS separates RS into two regions.

Indeed, the condition in Theorem 4.2 gives more, namely the essential ingredients of
the well-behavior of VS. One might think that HUM can be defined as the hypervolume
under M_{S} only on the domain of M_{S} in a sense similar to the partial AUC. Unfortunately,
the induced accuracy measure is still problematic and meaningless in practice although this
view would circumvent the problem whether M_{S} can actually separate R_{S}. Specifically in
RS spanned by all false probabilities, [9] provided an argument to elucidate that the HUMs
of perfect and useless markers might both be zero. Further, for arbitrary S, we clearly
characterize the condition for the occurrence of this detestable phenomenon and relate it to
that in Theorem 4.2.

Theorem 4.3. For K ≥ 3, the HUM V_{S} under M_{S}^{∗}(p_{S\{p}_{jk}_{}}) has the both features
(i) (Near perfect marker) V_{S} → 0 as p_{jk} → δ_{jk} for all p_{jk} ∈ S;

(ii) (Non-informative marker) V_{S} = 0 as p_{jk}_{1} = p_{jk}_{2} for all p_{jk}_{1}, p_{jk}_{2} ∈ S;

if and only if neither S = {p_{11}, . . . , p_{KK}} nor S = {p_{kσ(k)} : σ(k) 6= k} for some σ.

Thus, the HUM is a rational summary index of the discriminability if and only if perfor- mance probabilities of interest satisfying the conditions in the previous theorem span a ROC subspace.

### 5 Conclusion

For measure of the discriminability of K-classification markers, this article provides a the- oretical framework to show that a proper measure based on performance probabilities is

exactly the corresponding optimal ROC manifolds. Through parameterization of utility- maximization criterion, optimal ROC manifolds are verified as manifolds in the sense of dif- ferential geometry. This ensures some practically desirable features such as smoothness and differentiability and could support work directly in modeling of ROC manifolds. Moreover, we gives the sufficient and necessary conditions for the existence and well-behavior of HUM.

In practice, this justifies the usefulness of HUM as a summary index for the discriminability of markers when researchers are especially interested in some performance probabilities with respect to a suitable ROC subspace. We believe that this article established a scientific groundwork for further development in multi-classification analysis and a more general ROC analysis.

### A Appendix: Proof of Theorems

### A.1 Proof of Theorem 2.1

Proof. For arbitrary two classifiers bG_{1}and bG_{2}, it follows that bG_{λ}is also a classifier. Moreover,
the convexity of φ(C) is a direct result of

φ( bG_{λ}) = λφ( bG_{1}) + (1 − λ)φ( bG_{2}).

Let { bG_{k}} be a sequence of classifiers with φ( bG_{k}) converging to some point p_{0}. Since bG_{k}’s
have a finite support {1, . . . , K}, for any ε > 0 there exists a positive constant M_{ε} such that

P (sup

k

k( bG_{k}, Y)k > M_{ε}) < ε.

By the Prohorov’s theorem [10] and the dominated convergence theorem, there exists a
subsequence {( bG_{k}_{i}, Y)} converging in distribution to ( bG_{0}, Y) with φ( bG_{0}) = p_{0}. This fact

further implies the closedness of φ(C). Together with the boundedness of R, the compactness of φ(C) is directly obtained.

### A.2 Proof of Theorem 2.2

Proof. It follows from Theorem 2.1 that φ_{S}(C) is a convex set. Together with φ_{S}( bG) ∈

∂φS(C), there exists a hyperplane containing φS( bG) but no interior point of φS(C). By
standardizing the normal vector of the hyperplane as utility u, bG can be treated as a utility
classifier. Conversely, suppose not; that is, some bG^{∗} S Gbu. Since ujkpjk( bG) < ujkpjk( bG^{∗}) for
some (j, k), we have u^{>}φ_{S}( bG) < u^{>}φ_{S}( bG^{∗}), which contradicts that bG_{u} is a utility classifier.

### A.3 Proof of Theorem 4.1

Proof. For eS ⊂ S, if {p_{S}_{e}(t) : t ∈ [0, 1]} ∩ φ_{S}_{e}(C) = ∅, any p_{S}(t) with the projection p_{S}_{e}(t)
onto R_{S}_{e} has no intersection with φS(C). The rest of this proof ignores the degenerate MS

because its dimension is less than K − 1. For these reasons, it suffices to investigate the case
of #S = K + 1 with {p_{kσ(k)}}^{K}_{k=1} ⊂ S. For some pij and p_{iσ(i)} ∈ S with σ(i) 6= j, we define

p_{S}(t) =

1

X

`=0

(−1)^{`}[(1 − 2t)p_{S}(0.5) − 2(` − t)p_{S}(`)]1[0.5`,0.5(1+`))(t)

with pS(0.5) = vec[1 − δi^{0}j^{0} + (2δi^{0}j^{0} − 1)1{(i,j),(i,σ(i))}((i^{0}, j^{0}))]. Since fLj and fL_{σ(i)} have a
common support, no classifier satisfies p_{iσ(i)}( bG) 6= p_{ij}( bG) and p_{i}^{0}_{j}( bG) = δ_{i}^{0}_{j} for i^{0} 6= i. Thus,
{pS(t) : t ∈ [0, 0.5]} ∩ φS(C) = ∅. Similarly, no classifier can be with pij( bG) = δij and
p_{iσ(i)}( bG) = 1 − δ_{iσ(i)}. It further implies that {p_{S}(t) : t ∈ [0.5, 1]} ∩ φ_{S}(C) = ∅.

### A.4 Proof of Theorem 4.3

Proof. Same with the argument in the proof of Theorem 4.1, we only need to consider non-
degenerate MS. Suppose that S is not one of the sets stated in the theorem and so there
exists {p_{jk}_{1}, p_{jk}_{2}} ⊂ S. Since

{pS : p is under MS} ⊂ {p_{S}_{e}: p is under M_{S}_{e}} ⊗ (⊗_{p}

j0k0∈S\ eS[0, 1]p_{j0k0})

for eS ⊂ S, it is available to obtain an upper bound of V_{S} by the inequality V_{S} ≤ V_{S}_{e}. Further,
one has V_{S} ≤ min_{k}_{1}_{6=k}_{2}_{,p}_{jk1}_{,p}_{jk2}_{∈S}V_{{p}

jk1,p_{jk2}}as well as V_{{p}

jk1,p_{jk2}}of a near perfect marker must
approach to zero. As for a non-informative marker, we directly obtain that V{p_{jk1},p_{jk2}} = 0
because p_{jk}_{1}( bG) = p_{jk}_{2}( bG). Thus, V_{S} ≤ V_{{p}

jk1,p_{jk2}} further ascertains that V_{S} = 0.

Conversely, given a prefect marker and S = {p_{11}, . . . , p_{KK}}, the corresponding V_{S} is the
hypervolume of a unit K-cube and equal to 1, and V_{S} = 0 for S = {p_{kσ(k)} : σ(k) 6= k}^{K}_{k=1}.
For a non-informative marker, one can calculate V_{S} = 1/K!, the hypervolume under the
hyperplane {p^{∗} ∈ R_{{p}

kσ(k)=1}^{K}_{k=1} : PK

k=1p^{∗}_{kσ(k)} = 1} for any S satisfying the condition. This
is precisely the assertion of the theorem.

### References

[1] D. Mossman, “Three-way ROCs,” Med. Decis. Making, vol. 19, no. 1, pp. 78 –89, Jan.

1999.

[2] S. Dreiseitl, L. Ohno-Machado, and M. Binder, “Comparing three-class diagnostic tests by three-way ROC analysis,” Med. Decis. Making, vol. 20, no. 3, pp. 323–331, Sep. 2000.

[3] X. He, C. Metz, B. Tsui, J. Links, and E. Frey, “Three-class ROC analysis—a decision theoretic approach under the ideal observer framework,” IEEE Trans. on Med. Imag.,

vol. 25, no. 5, pp. 571–581, May 2006.

[4] B. K. Scurfield, “Multiple-event forced-choice tasks in the theory of signal detectability,”

J. Math. Psych., vol. 40, no. 3, pp. 253–269, Sep. 1996.

[5] P. R. Halmos, Naive Set Theory. New York: Springer, 1998.

[6] C. M. Schubert, S. N. Thorsen, and M. E. Oxley, “The ROC manifold for classification systems,” Pattern Recognit., vol. 44, no. 2, pp. 350–362, Feb. 2011.

[7] B. K. Scurfield, “Generalization of the theory of signal detectability to n-event m- dimensional forced-choice tasks,” J. Math. Psych., vol. 42, no. 1, pp. 5–31, Mar. 1998.

[8] D. C. Edwards, C. E. Metz, and M. A. Kupinski, “Ideal observers and optimal ROC hypersurfaces in n-class classification,” IEEE Tans. on Med. Imag., vol. 23, no. 7, pp.

891–895, Jul. 2004.

[9] D. C. Edwards, C. E. Metz, and R. M. Nishikawa, “The hypervolume under the ROC hypersurface of “near-guessing” and “near-perfect” observers in n-class classification tasks,” IEEE Trans. on Med. Imag., vol. 24, no. 3, pp. 293–299, Mar. 2005.

[10] A. W. v. d. Vaart, Asymptotic Statistics. Cambridge: Cambridge University Press, 1998.