## Supplementary: On the Use of Unrealistic Predictions in Hundreds of Papers Evaluating Graph Representations

### Li-Chung Lin

^{1}

### , Cheng-Hung Liu

^{1}

### , Chih-Ming Chen

^{2}

### , Kai-Chin Hsu

^{3}

### , I-Feng Wu

^{4}

### , Ming-Feng Tsai

^{2}

### , and Chih-Jen Lin

^{1}

1

### National Taiwan University

2

### National Chengchi University

3

### University of Southern California

4

### ASUS Intelligent Cloud Services

### r08922141@ntu.edu.tw, ericliu8168@gmail.com, 104761501@nccu.edu.tw, kaichinh@usc.edu, ifengwu1518@gmail.com, mftsai@nccu.edu.tw, cjlin@csie.ntu.edu.tw

### 1 Proofs of the Main Paper

In this section we show the proofs of our main paper. To facilatate the discussion, we consider i :index of test instances, and j : index of labels.

We further assume that for test instance i

Ki :true number of labels,

Kˆi :predicted number of labels. (10)

Define

Micro-F1 = 2× TP sum

TP sum + FP sum + TP sum + FN sum, (11)

where “sum” means the accumulation of prediction results over all binary problems. Next we prove an upper bound of Micro-F1.

### 1.1 Proof of Theorem 1

Proof. Let TPi, FPi, and FNi be respectively true positive, false positive and false negative from the predictions of instance i. The definition of Ki and ˆKiin (10) implies

TPi+FPi = ˆKi and TPi+FNi = Ki. (12)

From (11) and (12),

Micro-F1 = 2× TP sum Pl

i=1 Ki+ ˆKi

= 2×Pl i=1TPi

Pl

i=1 Ki+ ˆKi

. (13)

From (12), TPi ≤ ˆKi and TPi ≤ K^{i}, so the result in (7) follows from
TPi ≤ min ˆKi, Ki

. (14)

The second inequality in (7) easily follows from min ˆKi, Ki

≤ ˆKi and Ki. Finally, when ˆKi = Ki,

2×Pl

i=1min ˆKi, Ki

Pl

i=1 Ki+ ˆKi

= 2×Pl i=1Ki

Pl

i=1 Ki+ Ki

= 1 (15)

achieves the maximum.

### 1.2 Proof of Theorem 2

Proof. By the assumption on decision values, selecting the top ones leads to

TPi =

(Kˆi if ˆKi ≤ K^{i},
Ki otherwise.

Thus the upper bound of TPi in (14) is achieved. Then (9) follows from (13). Moreover, from (15) and (9), Micro-F1 = 1 when ˆKi = Ki.

### 1.3 Proof of Theorem 3

Proof. By properties of multi-class problems mentioned before the theorem statement,

Ki = ˆKi = 1. (16)

The TP sum in (13) is now the same as the number of correct predictions among all instances. With (16), the proof follows from (13).

#nodes #edges #degrees per node Edge Type
BlogCatalog^{1} 10,312 333,983 64.78 undirected

Flickr^{1} 80,513 5,899,882 146.56 undirected

YouTube^{1} 1,138,499 2,990,443 5.25 undirected

PPI^{2} 56,944 409,358 14.38 undirected

1http://leitang.net/social dimension.html

2http://snap.stanford.edu/graphsage/

Table 4: Dataset Statistics

### 2 Details of Generating Embedding Vectors

Table 4 shows the data source used in experiments. The data are first preprocessed to the data format used in the various graph embedding toolkits. Since the objective of the graph embedding models is to learn and obtain an informative embedding for each node from the observed edges of the given graph, the label data is not used in this stage. The learned embeddings are then used in node classification. Due to the modeling process being unsuper- vised (no labels for the downstream tasks), the hyperparameters have to be picked carefully. To obtain consistent and comparable results, we have collected the applied hyperparameter settings from several representative papers in Tables 5, 6, and 7, and stated what values we set. In our experiments, all the generated embeddings are run by the toolkit released by the authors, which are listed in the second row of each table.

### 3 Additional Implementation Details

For training logistic regression, we use the Newton method (option -s 0) in LIBLINEAR (Fan et al., 2008), with stopping tolerance -e 0.0001. For methods that use LIBLINEAR’s parameter-selection functionality (e.g., cost- sensitive), the parameter -C is used. Otherwise, either a C value is specified by -c (e.g., cost-sensitive-simple), or the default cost -c 1 is used. For all cross validations, three folds were used.

### 3.1 Execution Environment for Graph Representation Learning

The upstream graph representation learning task is performed on one machine with Intel Xeon Gold 6230 CPU

@ 2.10GHz. The software version used is Python 3.8.10 and g++ 9.3.0.

### 3.2 Execution Environment for Classification Task

The downstream classification task is performed by distributing workload to four machines with Intel Xeon E5- 2620 v4 @ 2.10GHz and nine machines with Intel Xeon E5-2620 0 @ 2.00GHz. The software versions used on all machines are Python 3.9.6, g++ 11.1.0 and matlab R2019b.

### 4 Complete Experimental Results

In Tables 8, 9 and 10 we compare, respectively, the Macro-F1, Micro-F1 and Instance-F1 of all the training/pre- diction methods performed on the embedding vectors generated by the three methods of representation learning.

Dimension Window Size Walk Length Walk Times NS or HS^{1}

Defaults in the tool by 64 5 40 10 HS

DeepWalk authors^{2}

Perozzi et al. (2014) 128 10 - 80 HS

Tang et al. (2015) 128 10 40 40 HS

Grover and Leskovec (2016) 128 10 80 10 -

Khosla et al. (2021) 128 10 40 80 -

Our Setting 128 10 40 80 HS

1NS denotes Negative Sampling, and HS denotes Hierarchical Softmax.

2https://github.com/phanein/deepwalk. This is the source code provided by Perozzi et al. (2014).

Table 5: DeepWalk’s hyperparameters

Dimension Window Size Walk Length Walk Times (p, q)

Defaults in the tool by 128 10 80 10 (1, 1)

Node2vec authors^{1}

Grover and Leskovec (2016) 128 10 - 80 p, q ∈ {0.25, 0.5, 1, 2, 4}

Khosla et al. (2021) 128 10 40 80 p, q ∈ {0.25, 0.5, 1, 2, 4}

Our Setting 128 10 40 80 (4, 0.25)^{2}

1https://github.com/aditya-grover/node2vec. This is the source code provided by Grover and Leskovec (2016).

2According to the assumption of Node2vec, (4, 0.25) encourages the model to generate the DFS-style random walks, while (0.25, 4) generates the BFS-style random walks. The DFS one is generally better so we adopt it in the main paper.

Table 6: Node2vec’s hyperparameters

Dimension (1st + 2nd) Sample Times Negative Samples

Defaults in the tool by LINE 100 + 100 - 5

authors^{1}

Tang et al. (2015) 128 + 128^{†} 10 Billion 5

Grover and Leskovec (2016) 64 + 64 10×|E|^{‡} 5

Khosla et al. (2021) 64 + 64 10 Billion 5

Our Setting 64 + 64 10 Billion 5

1https://github.com/tangjianpku/LINE. This is the source code provided by Tang et al. (2015).

†Most subsequent works set the dimension to be 64 instead of 128 for each module (i.e. LINE- 1st and LINE-2nd). The purpose is to use the same 128 dimensions as other models.

‡|E| represents the number of edges in the given graph.

Table 7: LINE’s hyperparameters

We further observe that

• LINE has a larger improvement with one-vs-rest-basic-C over one-vs-rest-basic. This is because for binary logistic regression, there is a C value under which complete underfitting occurs (Chu et al., 2015), given by

C < 1
lmaxi||xxx^{i}||^{2}.

For DeepWalk and Node2vec, no normalization is done, and the resulting embedding vectors have varying L2-norm dependent on the data set. In our experiments, we observed that

10 < maxi||xxx^{i}||^{2} < 500.

For the LINE setting, as shown in Table 7, we choose Line-1st + LINE-2nd, so each embedding vector is a concatenation of two unit L2-norm vectors both with half the embedding vector dimension. Thus, we have

maxi||xxx^{i}||^{2} = 2.

We see that LINE is the closest to underfitting with C = 1 in one-vs-rest-basic, and therefore, LINE also benefits from one-vs-rest-basic-C more than the other methods.

### 4.1 Choice of CV Splits

A simple way to implement cost-sensitive-simple is as follows

• For each (C, t)

– Split data to folds

– Sequentially validate each fold

Under each (C, t), a standard CV procedure is conducted and and one can call the CV functionality in a package such as LIBLINEAR for the inner loop in the pseudocode. However, if we decide to use the same data splits in all CV processes, we generally need to maintain the data split by ourselves. The procedure is as follows

• Split data to folds

• For each (C, t)

– Sequentially validate each fold

Therefore maintaining the same data splits causes less flexibility in implementation.

However, as was mentioned previously, we suspect that different data splits may cause a higher variation of the results. We conduct a comparison here to see if this conjecture is true. In the tables, we denote the method of having different data splits for each (C, t) as cost-sensitive-simple-rand.

We observe that the difference between cost-sensitive-simple and cost-sensitive-simple-rand is very small in all cases. This robustness against the choice of CV splits allows the implementation flexibility.

### 4.2 Choice of (C, t) values for cost-sensitive-simple

For the cost-sensitive family of methods, there is a problem of choosing the grid of (C, t) values. A denser grid generally yields better performance at the cost of longer running time. Here we discuss how the final grid is chosen for cost-sensitive-simple.

In Tables 11, 12, 13 we compare, respectively, the Macro-F1, Micro-F1 and Instance-F1 of choosing various
ranges of C values with the same range of t values.^{1}We observe that in most cases there is a marginal improvement
with an increase of the number of C values used. However, the training time for higher C values are drastically
longer, in some cases up to 70 times longer than C = 1. Due to the insignificant improvement at a high cost, we
chose to use a single value C = 1.

Parambath et al. (2014) gave a theoretical bound relating the density of t and the performance of cost- sensitive. However, the bound is very loose and cannot be used as a practical rule to choose a proper range of t. In Tables 14, 15 and 16 we compare the Macro-F1, Micro-F1 and Instance-F1 of different numbers of t values chosen equidistantly from the range (0, 1] with C = 1. We observe that Macro-F1 is slightly better with more t values. However, Micro-F1 and Instance-F1 are usually better with a smaller number of t values. In the end we decide to choose seven t values for cost-sensitive-simple based on balance of Macro-F1 performance and running time. The justification for emphasizing Macro-F1 is given in Section 4.3.

### 4.3 The Macro-F1 and Micro-F1 Tradeoff

In this section we explain that observations from Tables 14–16 may be related to the different best t value settings for Macro-F1 and Micro-F1. Note that optimizing Macro-F1 and Micro-F1 is often a tradeoff. Because Macro-F1 is the average of label-wise F1 scores, the performance on a rare label is equally important to the performance on a frequent label. In contrast, Micro-F1 is the overall F1 score by considering all labels together. To confirm this difference, we consider another cost-sensitive method to optimize Micro-F1 (Parambath et al., 2014):

• cost-sensitive-simple-micro: The same (C, t) pair is used for all labels and the best is chosen by cross- validation Micro-F1 score. The grid of (C, t) pairs is the same as cost-sensitive-simple.

In Tables 17, 18 and 19 we compare cost-sensitive-simple and cost-sensitive-simple-micro by checking their Macro-F1, Micro-F1 and Instance-F1. We use C = 1 and consider seven t values chosen equidistantly from (0, 1]. We observe that cost-sensitive-simple-micro is significantly better at optimizing Micro-F1, but significantly worse at optimizing Macro-F1. The same observations hold for other settings of choosing t values.

We also observe that Instance-F1 has the same trend as Micro-F1. This result may be because that similar to Micro-F1, Instance-F1 is defined without particularly checking the predictions of rare labels.

### 5 Papers or Libraries Assuming the Number of Labels is Known in the Prediction Stage

In this section we list papers or libraries assuming the number of labels is known, either explicitly stated through text or visible in their evaluation code. We also include a section of likely suspects that implicitly cite these papers or libraries in their evaluation procedures.

1Seven values chosen equidistantly from (0, 1]

Training and Macro-F1

prediction methods DeepWalk Node2vec LINE

BlogCatalog

unrealistic 0.276 ± 0.005 0.294 ± 0.004 0.239 ± 0.008

one-vs-rest-basic 0.190 ± 0.010 0.203 ± 0.010 0.150 ± 0.003 one-vs-rest-basic-C 0.208 ± 0.008 0.220 ± 0.008 0.195 ± 0.007 one-vs-rest-no-empty 0.241 ± 0.011 0.265 ± 0.007 0.204 ± 0.011

thresholding 0.269 ± 0.006 0.283 ± 0.006 0.221 ± 0.006

cost-sensitive 0.270 ± 0.003 0.283 ± 0.006 0.250 ± 0.007 cost-sensitive-no-empty 0.268 ± 0.004 0.280 ± 0.007 0.251 ± 0.007 cost-sensitive-simple 0.266 ± 0.005 0.275 ± 0.005 0.251 ± 0.007 cost-sensitive-simple-rand 0.261 ± 0.006 0.276 ± 0.005 0.251 ± 0.009

Flickr

unrealistic 0.304 ± 0.003 0.306 ± 0.001 0.258 ± 0.005

one-vs-rest-basic 0.195 ± 0.003 0.191 ± 0.002 0.128 ± 0.004 one-vs-rest-basic-C 0.209 ± 0.001 0.208 ± 0.002 0.188 ± 0.003 one-vs-rest-no-empty 0.256 ± 0.001 0.259 ± 0.002 0.206 ± 0.004

thresholding 0.299 ± 0.004 0.302 ± 0.002 0.264 ± 0.002

cost-sensitive 0.297 ± 0.000 0.301 ± 0.002 0.279 ± 0.003 cost-sensitive-no-empty 0.298 ± 0.000 0.301 ± 0.002 0.284 ± 0.006 cost-sensitive-simple 0.294 ± 0.001 0.296 ± 0.002 0.279 ± 0.002 cost-sensitive-simple-rand 0.293 ± 0.001 0.295 ± 0.002 0.279 ± 0.003

YouTube

unrealistic 0.397 ± 0.007 0.415 ± 0.009 0.412 ± 0.005

one-vs-rest-basic 0.213 ± 0.003 0.240 ± 0.006 0.230 ± 0.004 one-vs-rest-basic-C 0.217 ± 0.004 0.242 ± 0.006 0.246 ± 0.005 one-vs-rest-no-empty 0.263 ± 0.003 0.284 ± 0.006 0.281 ± 0.004

thresholding 0.358 ± 0.005 0.374 ± 0.006 0.377 ± 0.004

cost-sensitive 0.360 ± 0.005 0.375 ± 0.003 0.374 ± 0.004 cost-sensitive-no-empty 0.359 ± 0.005 0.374 ± 0.003 0.375 ± 0.004 cost-sensitive-simple 0.349 ± 0.005 0.369 ± 0.005 0.371 ± 0.005 cost-sensitive-simple-rand 0.349 ± 0.005 0.368 ± 0.005 0.372 ± 0.004

PPI

unrealistic 0.483 ± 0.003 0.442 ± 0.003 0.504 ± 0.003

one-vs-rest-basic 0.181 ± 0.001 0.148 ± 0.000 0.232 ± 0.002 one-vs-rest-basic-C 0.183 ± 0.001 0.150 ± 0.000 0.243 ± 0.002 one-vs-rest-no-empty 0.181 ± 0.001 0.148 ± 0.000 0.232 ± 0.002

thresholding 0.482 ± 0.002 0.457 ± 0.002 0.498 ± 0.001

cost-sensitive 0.482 ± 0.002 0.461 ± 0.002 0.495 ± 0.002 cost-sensitive-no-empty 0.482 ± 0.002 0.461 ± 0.002 0.495 ± 0.002 cost-sensitive-simple 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002 cost-sensitive-simple-rand 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002

Table 8: Macro-F1 of various training/prediction techniques on embedding vectors generated by some represen- tation learning methods. Each entry is the average and standard deviation of five 80/20 training/testing splits. The score of the best training/prediction method (excluding unrealistic) is bold-faced.

Training and Micro-F1

prediction methods DeepWalk Node2vec LINE

BlogCatalog

unrealistic 0.417 ± 0.005 0.426 ± 0.006 0.406 ± 0.007

one-vs-rest-basic 0.334 ± 0.007 0.348 ± 0.012 0.296 ± 0.005 one-vs-rest-basic-C 0.344 ± 0.006 0.355 ± 0.012 0.335 ± 0.005 one-vs-rest-no-empty 0.390 ± 0.003 0.404 ± 0.006 0.370 ± 0.007

thresholding 0.390 ± 0.003 0.396 ± 0.010 0.353 ± 0.005

cost-sensitive 0.366 ± 0.007 0.371 ± 0.003 0.341 ± 0.004 cost-sensitive-no-empty 0.351 ± 0.010 0.360 ± 0.007 0.324 ± 0.009 cost-sensitive-simple 0.351 ± 0.006 0.362 ± 0.017 0.337 ± 0.004 cost-sensitive-simple-rand 0.353 ± 0.010 0.363 ± 0.006 0.332 ± 0.002

Flickr

unrealistic 0.416 ± 0.002 0.420 ± 0.005 0.409 ± 0.004

one-vs-rest-basic 0.283 ± 0.003 0.288 ± 0.002 0.271 ± 0.004 one-vs-rest-basic-C 0.291 ± 0.002 0.296 ± 0.002 0.289 ± 0.006 one-vs-rest-no-empty 0.377 ± 0.002 0.382 ± 0.004 0.373 ± 0.004

thresholding 0.370 ± 0.002 0.376 ± 0.003 0.364 ± 0.001

cost-sensitive 0.352 ± 0.003 0.358 ± 0.002 0.354 ± 0.005 cost-sensitive-no-empty 0.343 ± 0.006 0.355 ± 0.006 0.348 ± 0.013 cost-sensitive-simple 0.355 ± 0.003 0.360 ± 0.004 0.357 ± 0.004 cost-sensitive-simple-rand 0.356 ± 0.002 0.359 ± 0.005 0.356 ± 0.006

YouTube

unrealistic 0.470 ± 0.008 0.482 ± 0.008 0.480 ± 0.004

one-vs-rest-basic 0.287 ± 0.008 0.313 ± 0.007 0.315 ± 0.004 one-vs-rest-basic-C 0.290 ± 0.009 0.314 ± 0.007 0.325 ± 0.006 one-vs-rest-no-empty 0.382 ± 0.005 0.399 ± 0.004 0.402 ± 0.004

thresholding 0.387 ± 0.006 0.409 ± 0.003 0.412 ± 0.005

cost-sensitive 0.374 ± 0.010 0.400 ± 0.006 0.403 ± 0.004 cost-sensitive-no-empty 0.372 ± 0.008 0.400 ± 0.005 0.404 ± 0.004 cost-sensitive-simple 0.365 ± 0.006 0.390 ± 0.007 0.400 ± 0.005 cost-sensitive-simple-rand 0.368 ± 0.010 0.390 ± 0.003 0.401 ± 0.004

PPI

unrealistic 0.641 ± 0.001 0.626 ± 0.001 0.647 ± 0.001

one-vs-rest-basic 0.449 ± 0.001 0.433 ± 0.001 0.479 ± 0.001 one-vs-rest-basic-C 0.458 ± 0.001 0.441 ± 0.001 0.489 ± 0.002 one-vs-rest-no-empty 0.449 ± 0.001 0.433 ± 0.001 0.479 ± 0.001

thresholding 0.535 ± 0.002 0.482 ± 0.002 0.553 ± 0.001

cost-sensitive 0.533 ± 0.002 0.495 ± 0.002 0.548 ± 0.002 cost-sensitive-no-empty 0.533 ± 0.002 0.495 ± 0.002 0.548 ± 0.002 cost-sensitive-simple 0.529 ± 0.002 0.495 ± 0.002 0.547 ± 0.002 cost-sensitive-simple-rand 0.529 ± 0.002 0.494 ± 0.002 0.548 ± 0.002

Table 9: Micro-F1 of various training/prediction techniques on embedding vectors generated by some representa- tion learning methods. Each entry is the average and standard deviation of five 80/20 training/testing splits. The

Training and Instance-F1

prediction methods DeepWalk Node2vec LINE

BlogCatalog

unrealistic 0.422 ± 0.004 0.433 ± 0.006 0.408 ± 0.005

one-vs-rest-basic 0.255 ± 0.004 0.265 ± 0.009 0.222 ± 0.004 one-vs-rest-basic-C 0.267 ± 0.003 0.275 ± 0.009 0.256 ± 0.003 one-vs-rest-no-empty 0.403 ± 0.003 0.416 ± 0.006 0.389 ± 0.006

thresholding 0.347 ± 0.006 0.354 ± 0.012 0.315 ± 0.002

cost-sensitive 0.340 ± 0.006 0.348 ± 0.003 0.314 ± 0.005 cost-sensitive-no-empty 0.352 ± 0.011 0.363 ± 0.011 0.325 ± 0.015 cost-sensitive-simple 0.327 ± 0.004 0.335 ± 0.011 0.313 ± 0.006 cost-sensitive-simple-rand 0.327 ± 0.003 0.334 ± 0.007 0.311 ± 0.004

Flickr

unrealistic 0.406 ± 0.002 0.410 ± 0.004 0.394 ± 0.004

one-vs-rest-basic 0.211 ± 0.002 0.212 ± 0.002 0.185 ± 0.002 one-vs-rest-basic-C 0.217 ± 0.001 0.220 ± 0.002 0.207 ± 0.003 one-vs-rest-no-empty 0.386 ± 0.002 0.390 ± 0.004 0.376 ± 0.004

thresholding 0.352 ± 0.002 0.353 ± 0.005 0.325 ± 0.002

cost-sensitive 0.342 ± 0.001 0.343 ± 0.003 0.326 ± 0.002 cost-sensitive-no-empty 0.345 ± 0.008 0.357 ± 0.008 0.349 ± 0.018 cost-sensitive-simple 0.344 ± 0.006 0.345 ± 0.005 0.329 ± 0.008 cost-sensitive-simple-rand 0.346 ± 0.005 0.343 ± 0.007 0.328 ± 0.008

YouTube

unrealistic 0.454 ± 0.007 0.467 ± 0.006 0.464 ± 0.003

one-vs-rest-basic 0.245 ± 0.009 0.262 ± 0.008 0.254 ± 0.004 one-vs-rest-basic-C 0.247 ± 0.010 0.263 ± 0.008 0.263 ± 0.005 one-vs-rest-no-empty 0.412 ± 0.005 0.427 ± 0.005 0.428 ± 0.003

thresholding 0.389 ± 0.006 0.400 ± 0.006 0.398 ± 0.006

cost-sensitive 0.380 ± 0.011 0.399 ± 0.006 0.394 ± 0.004 cost-sensitive-no-empty 0.388 ± 0.010 0.414 ± 0.005 0.416 ± 0.006 cost-sensitive-simple 0.372 ± 0.009 0.388 ± 0.007 0.388 ± 0.004 cost-sensitive-simple-rand 0.377 ± 0.012 0.388 ± 0.006 0.388 ± 0.006

PPI

unrealistic 0.577 ± 0.002 0.565 ± 0.001 0.583 ± 0.002

one-vs-rest-basic 0.449 ± 0.001 0.438 ± 0.001 0.464 ± 0.001 one-vs-rest-basic-C 0.457 ± 0.001 0.445 ± 0.002 0.475 ± 0.001 one-vs-rest-no-empty 0.449 ± 0.001 0.438 ± 0.001 0.464 ± 0.001

thresholding 0.503 ± 0.002 0.456 ± 0.002 0.517 ± 0.001

cost-sensitive 0.501 ± 0.002 0.467 ± 0.002 0.513 ± 0.002 cost-sensitive-no-empty 0.501 ± 0.002 0.467 ± 0.002 0.513 ± 0.002 cost-sensitive-simple 0.496 ± 0.002 0.466 ± 0.002 0.512 ± 0.002 cost-sensitive-simple-rand 0.497 ± 0.002 0.466 ± 0.002 0.513 ± 0.002

Table 10: Instance-F1 of various training/prediction techniques on embedding vectors generated by some repre- sentation learning methods. Each entry is the average and standard deviation of five 80/20 training/testing splits.

The score of the best training/prediction method (excluding unrealistic) is bold-faced.

Choice of Macro-F1

Cvalues DeepWalk Node2vec LINE

BlogCatalog

1 0.266 ± 0.005 0.275 ± 0.005 0.251 ± 0.007 1, 10, 100 0.266 ± 0.005 0.276 ± 0.006 0.249 ± 0.006 0.01, 0.1, 1, 10, 100 0.266 ± 0.004 0.282 ± 0.005 0.250 ± 0.007

Flickr

1 0.294 ± 0.001 0.296 ± 0.002 0.279 ± 0.002 1, 10, 100 0.293 ± 0.001 0.295 ± 0.003 0.278 ± 0.002 0.01, 0.1, 1, 10, 100 0.297 ± 0.002 0.299 ± 0.002 0.279 ± 0.002

YouTube

1 0.349 ± 0.005 0.369 ± 0.005 0.371 ± 0.005 1, 10, 100 0.348 ± 0.005 0.368 ± 0.005 0.371 ± 0.004 0.01, 0.1, 1, 10, 100 0.357 ± 0.004 0.374 ± 0.005 0.374 ± 0.004

PPI

1 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002 1, 10, 100 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002 0.01, 0.1, 1, 10, 100 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002

Table 11: Macro-F1 of various choices of C values for cost-sensitive-simple on embedding vectors generated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 train- ing/testing splits. The score of the best training/prediction method is bold-faced.

Choice of Micro-F1

Cvalues DeepWalk Node2vec LINE

BlogCatalog

1 0.351 ± 0.006 0.362 ± 0.017 0.337 ± 0.004 1, 10, 100 0.355 ± 0.006 0.367 ± 0.017 0.336 ± 0.009 0.01, 0.1, 1, 10, 100 0.353 ± 0.010 0.371 ± 0.011 0.342 ± 0.007

Flickr

1 0.355 ± 0.003 0.360 ± 0.004 0.357 ± 0.004 1, 10, 100 0.355 ± 0.003 0.361 ± 0.005 0.360 ± 0.005 0.01, 0.1, 1, 10, 100 0.358 ± 0.003 0.361 ± 0.003 0.358 ± 0.005

YouTube

1 0.365 ± 0.006 0.390 ± 0.007 0.400 ± 0.005 1, 10, 100 0.365 ± 0.006 0.391 ± 0.008 0.401 ± 0.005 0.01, 0.1, 1, 10, 100 0.372 ± 0.007 0.395 ± 0.006 0.400 ± 0.004

PPI

1 0.529 ± 0.002 0.495 ± 0.002 0.547 ± 0.002 1, 10, 100 0.529 ± 0.002 0.495 ± 0.002 0.547 ± 0.002 0.01, 0.1, 1, 10, 100 0.529 ± 0.002 0.495 ± 0.002 0.547 ± 0.002

Table 12: Micro-F1 of various choices of C values for cost-sensitive-simple on embedding vectors generated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 train- ing/testing splits. The score of the best training/prediction method is bold-faced.

Choice of Instance-F1

Cvalues DeepWalk Node2vec LINE

BlogCatalog

1 0.327 ± 0.004 0.335 ± 0.011 0.313 ± 0.006 1, 10, 100 0.329 ± 0.003 0.340 ± 0.010 0.307 ± 0.007 0.01, 0.1, 1, 10, 100 0.333 ± 0.004 0.350 ± 0.006 0.311 ± 0.004

Flickr

1 0.344 ± 0.006 0.345 ± 0.005 0.329 ± 0.008 1, 10, 100 0.344 ± 0.005 0.345 ± 0.005 0.330 ± 0.008 0.01, 0.1, 1, 10, 100 0.344 ± 0.006 0.345 ± 0.005 0.331 ± 0.007

YouTube

1 0.372 ± 0.009 0.388 ± 0.007 0.388 ± 0.004 1, 10, 100 0.373 ± 0.010 0.389 ± 0.010 0.388 ± 0.005 0.01, 0.1, 1, 10, 100 0.378 ± 0.009 0.392 ± 0.008 0.384 ± 0.006

PPI

1 0.496 ± 0.002 0.466 ± 0.002 0.512 ± 0.002 1, 10, 100 0.496 ± 0.002 0.466 ± 0.002 0.513 ± 0.002 0.01, 0.1, 1, 10, 100 0.497 ± 0.002 0.466 ± 0.002 0.512 ± 0.002

Table 13: Instance-F1 of various choices of C values for cost-sensitive-simple on embedding vectors generated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 train- ing/testing splits. The score of the best training/prediction method is bold-faced.

### 5.1 Papers Explicitly Assuming the Number of Labels is Known

1. Tang and Liu (2009a): The paper is at https://dl.acm.org/doi/pdf/10.1145/1557019.1557109. “It has been shown that different thresholding strategies lead to quite different performance (Fan and Lin, 2007; Tang et al., 2009). To avoid the affection of thresholding, we assume the number of labels on the test data are already known and check how the top ranking predictions match with the true labels. Two commonly used measures Micro-F1 and Macro-F1 are adopted to evaluate the classification performance.”

2. Tang and Liu (2009b): The paper is at https://dl.acm.org/doi/pdf/10.1145/1645953.1646094. “Note that our prediction problem is essentially multi-label. It is empirically shown that thresholding can affect the final prediction performance drastically (Fan and Lin, 2007; Tang et al., 2009). For evaluation purpose, we assume the number of labels of unobserved nodes is already known and check the match of the top-ranking labels with the truth. Such a scheme has been adopted for other multi-label evaluation works Liu et al.

(2006). We randomly sample a portion of nodes as labeled and report the average performance of 10 runs in terms of Micro-F1 and Macro-F1. We use the same setting as in Tang and Liu (2009a) for the baseline methods for Flickr and BlogCatalog, thus the performance on the two data sets are reported here directly.”

3. Tang et al. (2010): The paper is at https://dl.acm.org/doi/pdf/10.1145/1964858.1964861. “We follow the same evaluation procedure as in Tang and Liu (2009a,b).”

4. Menon and Elkan (2010): The paper is at https://link.springer.com/content/pdf/10.1007/s10618-010- 0189-3.pdf. “Following Tang and Liu (2009a), for multilabel data we assume that the number of labels are

# of Macro-F1

tvalues DeepWalk Node2vec LINE

BlogCatalog

4 0.265 ± 0.004 0.274 ± 0.010 0.245 ± 0.010 7 0.266 ± 0.005 0.275 ± 0.005 0.251 ± 0.007 10 0.264 ± 0.007 0.274 ± 0.009 0.252 ± 0.012

Flickr

4 0.291 ± 0.002 0.294 ± 0.004 0.276 ± 0.002 7 0.294 ± 0.001 0.296 ± 0.002 0.279 ± 0.002 10 0.294 ± 0.002 0.297 ± 0.003 0.280 ± 0.003

YouTube

4 0.345 ± 0.004 0.365 ± 0.005 0.371 ± 0.005 7 0.349 ± 0.005 0.369 ± 0.005 0.371 ± 0.005 10 0.350 ± 0.005 0.370 ± 0.004 0.372 ± 0.004

PPI

4 0.477 ± 0.002 0.459 ± 0.002 0.490 ± 0.002 7 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002 10 0.482 ± 0.002 0.461 ± 0.002 0.495 ± 0.002

Table 14: Macro-F1 of various number of t values for cost-sensitive-simple on embedding vectors generated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 training/test- ing splits. The score of the best training/prediction method is bold-faced.

# of Micro-F1

tvalues DeepWalk Node2vec LINE

BlogCatalog

4 0.377 ± 0.008 0.385 ± 0.005 0.356 ± 0.005 7 0.351 ± 0.006 0.362 ± 0.017 0.337 ± 0.004 10 0.334 ± 0.009 0.349 ± 0.010 0.320 ± 0.011

Flickr

4 0.374 ± 0.002 0.376 ± 0.002 0.373 ± 0.002 7 0.355 ± 0.003 0.360 ± 0.004 0.357 ± 0.004 10 0.352 ± 0.007 0.361 ± 0.003 0.356 ± 0.005

YouTube

4 0.392 ± 0.004 0.408 ± 0.006 0.410 ± 0.003 7 0.365 ± 0.006 0.390 ± 0.007 0.400 ± 0.005 10 0.362 ± 0.009 0.400 ± 0.007 0.406 ± 0.004

PPI

4 0.517 ± 0.002 0.491 ± 0.002 0.540 ± 0.002 7 0.529 ± 0.002 0.495 ± 0.002 0.547 ± 0.002 10 0.533 ± 0.002 0.495 ± 0.002 0.547 ± 0.002

Table 15: Micro-F1 of various number of t values for cost-sensitive-simple on embedding vectors generated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 training/test- ing splits. The score of the best training/prediction method is bold-faced.

# of Instance-F1

tvalues DeepWalk Node2vec LINE

BlogCatalog

4 0.338 ± 0.009 0.344 ± 0.003 0.319 ± 0.003 7 0.327 ± 0.004 0.335 ± 0.011 0.313 ± 0.006 10 0.321 ± 0.007 0.330 ± 0.006 0.304 ± 0.007

Flickr

4 0.353 ± 0.001 0.351 ± 0.002 0.337 ± 0.002 7 0.344 ± 0.006 0.345 ± 0.005 0.329 ± 0.008 10 0.341 ± 0.004 0.343 ± 0.001 0.329 ± 0.004

YouTube

4 0.393 ± 0.006 0.404 ± 0.005 0.399 ± 0.005 7 0.372 ± 0.009 0.388 ± 0.007 0.388 ± 0.004 10 0.373 ± 0.008 0.399 ± 0.006 0.395 ± 0.005

PPI

4 0.485 ± 0.002 0.462 ± 0.002 0.505 ± 0.002 7 0.496 ± 0.002 0.466 ± 0.002 0.512 ± 0.002 10 0.501 ± 0.002 0.467 ± 0.002 0.513 ± 0.003

Table 16: Instance-F1 of various number of t values for cost-sensitive-simple on embedding vectors generated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 train- ing/testing splits. The score of the best training/prediction method is bold-faced.

known, and we measure how well the predicted score for each tag agrees with the true label. Agreement is measured using the F1 micro and macro scores.”

5. Tang and Liu (2011): The paper is at https://link.springer.com/content/pdf/10.1007/s10618-010-0210-x.pdf.

“It has been shown that different thresholding strategies lead to quite different performances (Fan and Lin, 2007; Tang et al., 2009). To avoid the effect of thresholding, we assume the number of labels on the test data is already known and check how the top-ranking predictions match with the true labels.”

6. Wang and Sukthankar (2011): The paper is at https://ieeexplore.ieee.org/abstract/document/611322 4. “Since our classification problem is essentially a multi-label task, during the prediction procedure, we assume that the number of labels for the unlabeled nodes is already known and assign the labels according to the top-ranking class. Such a scheme has been adopted for multi-label evaluation in social network datasets (Tang and Liu, 2009a,b).”

7. Tang et al. (2012): The paper is at https://ieeexplore.ieee.org/abstract/document/5710923. “It is empirically shown that thresholding can affect the final prediction performance drastically (Fan and Lin, 2007; Tang et al., 2009). For evaluation purposes, we assume the number of labels of unobserved nodes is already known, and check whether the top-ranking predicted labels match with the actual labels.”

8. Wang and Sukthankar (2013): The paper is at https://dl.acm.org/doi/abs/10.1145/2487575.2487610.

“Since our problem is essentially a multi-label classification task, we assume that the number of labels for the unlabeled nodes is already known (e.g., based on the output of a separate classifier) and assign the labels according to the top-ranking set of classes at the conclusion of the inference process. Such a scheme

Training and Macro-F1

prediction methods DeepWalk Node2vec LINE

BlogCatalog

cost-sensitive-simple 0.266 ± 0.005 0.275 ± 0.005 0.251 ± 0.007 cost-sensitive-simple-micro 0.251 ± 0.006 0.259 ± 0.008 0.231 ± 0.008

Flickr

cost-sensitive-simple 0.294 ± 0.001 0.296 ± 0.002 0.279 ± 0.002 cost-sensitive-simple-micro 0.279 ± 0.002 0.283 ± 0.002 0.255 ± 0.003

YouTube

cost-sensitive-simple 0.349 ± 0.005 0.369 ± 0.005 0.371 ± 0.005 cost-sensitive-simple-micro 0.324 ± 0.003 0.350 ± 0.005 0.351 ± 0.003

PPI

cost-sensitive-simple 0.481 ± 0.002 0.460 ± 0.002 0.494 ± 0.002 cost-sensitive-simple-micro 0.408 ± 0.001 0.358 ± 0.002 0.446 ± 0.002

Table 17: Macro-F1 for cost-sensitive-simple and cost-sensitive-simple-micro on embedding vectors gener- ated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 training/testing splits. The score of the best training/prediction method is bold-faced.

Training and Micro-F1

prediction methods DeepWalk Node2vec LINE

BlogCatalog

cost-sensitive-simple 0.351 ± 0.006 0.362 ± 0.017 0.337 ± 0.004 cost-sensitive-simple-micro 0.398 ± 0.004 0.409 ± 0.008 0.383 ± 0.006

Flickr

cost-sensitive-simple 0.355 ± 0.003 0.360 ± 0.004 0.357 ± 0.004 cost-sensitive-simple-micro 0.395 ± 0.003 0.398 ± 0.002 0.397 ± 0.002

YouTube

cost-sensitive-simple 0.365 ± 0.006 0.390 ± 0.007 0.400 ± 0.005 cost-sensitive-simple-micro 0.417 ± 0.005 0.432 ± 0.004 0.433 ± 0.002

PPI

cost-sensitive-simple 0.529 ± 0.002 0.495 ± 0.002 0.547 ± 0.002 cost-sensitive-simple-micro 0.561 ± 0.001 0.545 ± 0.002 0.576 ± 0.002

Table 18: Micro-F1 for cost-sensitive-simple and cost-sensitive-simple-micro on embedding vectors gener- ated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 training/testing splits. The score of the best training/prediction method is bold-faced.

Training and Instance-F1

prediction methods DeepWalk Node2vec LINE

BlogCatalog

cost-sensitive-simple 0.327 ± 0.004 0.335 ± 0.011 0.313 ± 0.006 cost-sensitive-simple-micro 0.343 ± 0.016 0.347 ± 0.006 0.342 ± 0.004

Flickr

cost-sensitive-simple 0.344 ± 0.006 0.345 ± 0.005 0.329 ± 0.008 cost-sensitive-simple-micro 0.357 ± 0.002 0.360 ± 0.003 0.343 ± 0.002

YouTube

cost-sensitive-simple 0.372 ± 0.009 0.388 ± 0.007 0.388 ± 0.004 cost-sensitive-simple-micro 0.417 ± 0.006 0.427 ± 0.006 0.418 ± 0.003

PPI

cost-sensitive-simple 0.496 ± 0.002 0.466 ± 0.002 0.512 ± 0.002 cost-sensitive-simple-micro 0.529 ± 0.001 0.514 ± 0.002 0.537 ± 0.002

Table 19: Micro-F1 for cost-sensitive-simple and cost-sensitive-simple-micro on embedding vectors gener- ated by some representation learning methods. Each entry is the average and standard deviation of five 80/20 training/testing splits. The score of the best training/prediction method is bold-faced.

has been adopted for multi-label evaluation in social network datasets (Tang and Liu, 2009b; Wang and Sukthankar, 2011).”

9. Wang et al. (2013): The paper is at https://link.springer.com/article/10.1007/s10115-012-0555-0. “We
follow the same evaluation procedure as in Tang and Liu (2009a,b).^{2}”

10. Li et al. (2016b): The paper is at https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152857.

“To evaluate the performance of different methods, dataset will be divided into training set and testing set randomly. The labels of nodes in the training set are known, and the labels of nodes in the testing set are unknown. All the classification methods need to utilize the training set to predict the labels of all nodes in the testing set. For handling multi-label classification task, we apply the typically used strategy, which assumes the number of labels of unlabeled nodes is already known and check the match of the top-ranking labels with the truth (Wang and Sukthankar, 2013; Tang and Liu, 2011, 2009b).”

11. Li et al. (2016a): The paper is at https://aclanthology.org/P16-1095.pdf. As the datasets are not only multi-class but also multi-label, we usually need a thresholding method to test the results. But literature gives a negative opinion of arbitrarily choosing thresholding methods because of the considerably different performances. To avoid this, we assume that the number of the labels is already known in all the test processes.

12. Nandanwar and Murty (2016): The paper is at https://dl.acm.org/doi/abs/10.1145/2939672.2939782. “For each node in the test set, decision values are obtained from the respective class models. We assign s most probable classes to the node using these decision values, where s is equal to the number of labels assigned to the node originally.”

2http://leitang.net/social dimension.html

13. Rizos et al. (2017): The paper is at https://journals.plos.org/plosone/article?id=10.1371/journal.pone.017 3347. “Following Tang and Liu (2009a); Devooght et al. (2014), we assume the true number of labels for each vertex to be known.”

14. Qiu et al. (2018): The paper is at https://dl.acm.org/doi/abs/10.1145/3159652.3159706. “To avoid the thresholding effect (Tang et al., 2009), we assume that the number of labels for test data is given (Perozzi et al., 2014; Tang et al., 2009). We repeat the prediction procedure 10 times and evaluate the performance in terms of average Micro-F1 and average Macro-F1 (Tsoumakas et al., 2010).”

15. Zhang et al. (2018d): The paper is at https://www.sciencedirect.com/science/article/pii/S092523121830246 7. “For evaluation purpose, we assume that the number of labels for the unlabeled nodes is already known and we assign the labels according to the top-ranking set of class membership probabilities. Such a scheme has been adopted for multi-label evaluation in social network datasets (Wang and Sukthankar, 2013; Tang and Liu, 2009a).”

16. Zhang et al. (2018c): The paper is at https://www.sciencedirect.com/science/article/abs/pii/S095070511 8303186. “For evaluation purpose, we assume that the number of labels for the unlabeled nodes is already known and assign the labels according to the top-ranking set of class-membership probabilities. Such a scheme has been adopted for multi-label evaluation in social network datasets (Tang et al., 2012)”

17. Zhang et al. (2018b): The paper is at https://link.springer.com/chapter/10.1007/978-3-030-02922-7 25.

“if the probability of a test instance belonging to a label is more than 0.5, MLKNN assigns this class label to this test instance. We made a minor change. For fair comparison, in all methods, we assume that the number of labels for all unlabeled nodes are already known and we assign the labels based on the classes with highest probability. Such a scheme has been adopted for multilabel evaluation in social network data sets (Tang and Liu, 2009a; Wang and Sukthankar, 2013) and improve the performance of MLKNN.”

18. Goyal and Ferrara (2018): The paper is at https://www.sciencedirect.com/science/article/abs/pii/S09507 05118301540. “For data sets with multiple labels per node, we assume that we know how many labels to predict.”

19. Ye et al. (2018): The paper is at https://link.springer.com/article/10.1007/s41019-018-0062-8. “In multi- label prediction phrase, the goal is to find the most probable classes for each unlabeled node. Since most methods yield a ranking of labels rather than an exact assignment, a threshold is often required. To avoid the affection of introducing a threshold, we assign s most probable classes to a node, where s is the number of labels assigned to the node originally.”

20. Nandanwar and Murty (2018): The paper is at https://ojs.aaai.org/index.php/AAAI/article/view/11787. “In our multilabel setting, each node has a different number of labels assigned to it. For an unlabeled node, we predict a set of labels using predicted class membership probabilities. The cardinality of the predicted label set is equal to that of its original label set.”

21. Berberidis et al. (2018): The paper is at https://ieeexplore.ieee.org/abstract/document/8622130. “Similar to Grover and Leskovec (2016) and Perozzi et al. (2014), during evaluation the number of labels per sampled node is known, and check how many of them are in the top predictions.”

22. Yin and Wei (2019): The paper is at https://dl.acm.org/doi/abs/10.1145/3292500.3330860. “To avoid the thresholding effect (Tang et al., 2009), we assume that the number of labels for test data is given (Perozzi

et al., 2014). The performance of each method is evaluated in terms of average Micro-F1 and average Macro-F1 (Tsoumakas et al., 2010), and we only report Micro-F1 as we experience similar behaviors with Macro-F1. Table 7 and Table 8 show the node classification results on BlogCatalog and Flickr. Surprisingly, DeepWalk outperforms all successors other than STRAP on both datasets.”

23. Schl¨otterer et al. (2019): The paper is at https://ieeexplore.ieee.org/abstract/document/8816899. “Since BlogCatalog is multi-label, we first obtain the number of actual labels to predict for each sample from the test set. Then we predict the k most probable classes, where k is the number of labels to predict. This is a common choice in the evaluation setup of the reproduced methods. All methods report the Macro-F1 score, except for Walklets, reporting Micro-F1, which we follow in the reproduction.”

24. Qiu et al. (2019): The paper is at https://dl.acm.org/doi/abs/10.1145/3308558.3313446. “To avoid the thresholding effect, we take the assumption that was made in DeepWalk, LINE, and node2vec, that is, the number of labels for vertices in the test data is given Grover and Leskovec (2016); Perozzi et al. (2014);

Tang et al. (2009).”

25. Berberidis et al. (2019): The paper is at https://ieeexplore.ieee.org/abstract/document/8590776. “Similar to Grover and Leskovec (2016) and Perozzi et al. (2014), during evaluation of accuracy the number of labels per sampled node is known, and check how many of them are in the top predictions.”

26. Berberidis and Giannakis (2019): The paper is at http://www.mlgworkshop.org/2019/papers/MLG2019 p aper 40.pdf. “In the testing phase, we sorted the predicted class probabilities for each node in decreasing order, and extracted the top-ki ranking labels, were ki is the true number of labels of node vi. We then computed the Micro- and Macro-averaged F1 scores of the predicted labels.”

27. Berberidis and Giannakis (2021): The paper is at https://ieeexplore.ieee.org/abstract/document/8778744.

“In the testing phase, we sorted the predicted class probabilities for each node in decreasing order, and extracted the top-ki ranking labels, were ki is the true number of labels of node vi. We then computed the Micro- and Macro-averaged F1 scores of the predicted labels.”

28. Post˘avaru et al. (2020): The paper is at https://arxiv.org/abs/2010.06992. “The other datasets have multiple labels per node, and we are using a One-vs-The-Rest ensemble. When evaluating, we assume the number of correct labels, K, is known and select the top K probabilities from the ensemble.”

29. Yue et al. (2020): The paper is at https://academic.oup.com/bioinformatics/article/36/4/1241/5581350?l ogin=true. “We assign top αi predictions to the node i as its predicted labels, where αi is the number of golden labels of the node i in the testing set. Accuracy, Macro-F1 and Micro-F1 are used to evaluate the performance of different embedding methods on the testing set.”

30. Devooght et al. (2014): The paper is at https://dl.acm.org/doi/abs/10.1145/2566486.2567986. “Each element of the BlogCatalog and Flickr datasets can have multiple labels, and the presented algorithms produce for each element a ranking of the most probable labels. This ranking is compared to the ground- truth using the micro and macro-average of the F-measure, respectively noted microF1 an macroF1 [20, 26], a pair of metrics well-known in information retrieval. The F-measure is the harmonic mean of precision and recall. The micro- and macro-average are two ways of computing the F-measure when dealing with multiple labels. The microF1 is defined using directly the global recall (ρ) and precision (π) of the results (as in Tang and Liu (2009a,b); Tang et al. (2010), ρ and π are computed on the ranking truncated to the true number of labels)”

31. Krohn-Grimberghe et al. (2012): The paper is at https://dl.acm.org/doi/10.1145/2124295.2124317.

“As suggested in Tang and Liu (2009b), we remove the dependency on the length of the top-N list for performance evaluation w.r.t. Micro-F1 and Macro-F1 by assuming that the number of true instances is known at prediction time.”

32. Santos et al. (2018): The paper is at https://dl.acm.org/doi/pdf/10.1145/3201603. “We have considered two different evaluation measures: Precision at 1 (P@1) measures the percentage of nodes for which the category with the highest predicted score is among the observed labels for this node. For monolabel classification, this should be the target label, while for multilabel classification, this could be any of the target labels.

Precision at k (P@k) is the proportion of correct labels in the set of k labels with the highest predicted scores. For the monolabel dataset DBLP, we only make use of Precision at 1 (P@1). For the multilabels dataset, P@k will denote an average over all the node types, with k set to the number of categories a node belongs to. We optimized the different models with regard to microaverage, but we report both microaverage and macroaverage precision P@•”

33. Sun et al. (2021): The paper is at https://link.springer.com/chapter/10.1007/978-981-15-8411-4 219.

“The prediction step is the same as the experimental procedure in NetMF (Qiu et al., 2018). We divide the datasets randomly into training and testing parts. The training ratio ranges from 10 to 90%. The rest data is the testing part. The multi-label classification task is implemented by one-vs-rest logistic regression model.

The result of the testing step is a list of labels’ ranking, instead of a fixed label. The number of labels is given in the experiment. The experiments repeat ten times and compute the average of macro-F1 and micro-F1 score, respectively as evaluation.”

### 5.2 Papers Using Number of Labels in Their Code

34. Tang et al. (2015): The paper is at https://dl.acm.org/doi/abs/10.1145/2736277.2741093. The code is at https://github.com/tangjianpku/LINE. Predictions were done at linux/evaluate/program/score.cpp. The following segment shows that the number of labels is used.

FILE *fi = fopen(candidate_file, "rb");

**while** (fscanf(fi, "%d", &len) == 1)
{

v_nlabels.push_back(len);

**for** **(int k = 0; k != len; k++)**
{

fscanf(fi, "%d", &lb);

**if** (lb2id[lb] == 0) lb2id[lb] = ++id_size;

id = lb2id[lb];

truth[id].insert(test_size);

}

test_size++;

}

fclose(fi);

From the above we can see v nlabels[k] represents the number of true labels for instance k.

**for** **(int k = 0; k != test_size; k++)**
{

fscanf(fi, "%d", &tmp);

**for** **(int i = 0; i != label_size; i++)**
{

id = pst2id[i];

ranked_list[i].id = id;

ranked_list[i].value = prob;

}

sort(ranked_list, ranked_list + label_size);

**int** n = v_nlabels[k];

**for** **(int i = 0; i != n; i++)**
{

id = ranked_list[i].id;

predict[id].insert(k);

} }

fclose(fi);

After ranking the probability values, they then predict the labels according to the number of true labels, which used ground-truth information. Finally, they used this prediction to evaluate Macro- and Micro-F1 scores.

35. Perozzi et al. (2014): The paper is at https://dl.acm.org/doi/abs/10.1145/2623330.2623732. The code is at https://github.com/phanein/deepwalk. Predictions were done at example graphs/scoring.py. The following code shows that the number of labels is used.

clf = TopKRanker(LogisticRegression()) clf.fit(X_train, y_train_)

# find out how many labels should be predicted
**top_k_list = [len(l) for l in y_test]**

preds = clf.predict(X_test, top_k_list) results = {}

averages = ["micro", "macro"]

**for** **average in averages:**

results[average] = f1_score(mlb.fit_transform(y_test), mlb.fit_transform(preds),

average=average)

36. Dalmia et al. (2018): The paper is at https://dl.acm.org/doi/abs/10.1145/3184558.3191523. The code is at https://github.com/ganeshjawahar/interpretNode. Predictions were done at downstream/nodeclass/scor- ing.py, which used the same way as Perozzi et al. (2014).

clf = TopKRanker(LogisticRegression()) clf.fit(X_train, y_train)

# find out how many labels should be predicted
**top_k_list = [len(l) for l in y_test]**

preds = clf.predict(X_test, top_k_list)

37. Khosla et al. (2020): The paper is at https://link.springer.com/chapter/10.1007/978-3-030-46150-8 24. The code is at https://git.l3s.uni-hannover.de/khosla/nerd. For evaluation, their predictions used the number of true labels, which can be found in evaluation/functions.py. The following segment shows that the number of labels is used.

**def** __get_f1(predictions, y, number_of_labels):

# find the indices (labels) with the highest probabilities (ascending order) pred_sorted = numpy.argsort(predictions, axis=1)

# the true number of labels for each node
**num_labels = numpy.sum(y, axis=1)**

# we take the best k label predictions for all nodes, where k is the true number of labels

pred_reshaped = []

**for** **pr, num in zip(pred_sorted, num_labels):**

pred_reshaped.append(pr[-num:].tolist())

# convert back to binary vectors

**pred_transformed = MultiLabelBinarizer(range(number_of_labels)).fit_transform(**

pred_reshaped)

f1_micro = f1_score(y, pred_transformed, average=’micro’)
f1_macro = f1_score(y, pred_transformed, average=’macro’)
**return** f1_micro, f1_macro

38. Rahman et al. (2020): The paper is at https://ieeexplore.ieee.org/abstract/document/9338414. The code is at https://github.com/HipGraph/Force2Vec. When computing node classification scores, a similar OneVsRestClassifier class is defined as in Perozzi et al. (2014), which takes in the number of labels for each entry and subsequently uses it for prediction.

#this class is defined similar to deepwalk paper.

**class** MyClass(OneVsRestClassifier):

**def** prediction(self, X, nclasses):

**ps = np.asarray(super(MyClass, self).predict_proba(X))**

#print(ps) predlabels = []

**for** **i, k in enumerate(nclasses):**

ps_ = ps[i, :]

labels = self.classes_[ps_.argsort()[-k:]].tolist() predlabels.append(labels)

**return** predlabels

modelLR = MyClass(LogisticRegression(random_state=0)).fit(trainX, trainY)
**ncs = [len(x) for x in testY]**

predictedY = modelLR.prediction(testX, ncs)

39. Liang et al. (2021): The paper is at https://ojs.aaai.org/index.php/ICWSM/article/view/18067. The code is at https://github.com/jiongqian/MILE. In eval embed.py, the multilabel classifier function uses the same strategy as Perozzi et al. (2014).

clf = OneVsRestClassifier(LogisticRegression(), n_jobs=12) # for multilabel scenario.

#penalty=’l2’

clf.fit(X_train, y_train)

y_pred_proba = clf.predict_proba(X_test) y_pred = []

**for** **inst in range(len(X_test)):**

# assume it has the same number of labels as the truth. Same strtegy is used in DeepWalk and Node2Vec paper.

**y_pred.append(y_pred_proba[inst, :].argsort()[::-1][:sum(y_test[inst, :])])**

40. Cui et al. (2020): The paper is at https://ieeexplore.ieee.org/abstract/document/9377928. The code is at https://github.com/7733com/MLANE. Again, as seen in src/utils.py, a OneVsRestClassifier is used to make predictions taking into account a top k list containing the number of labels for each entry.

**top_k_list = [len(l) for l in Y]**

Y_ = self.predict(X, top_k_list)

The following four papers (41, 42, 43, and 44) all contain the exact two lines of code as the above in their respective scoring or evaluation functions, only differing in the variable names used:

41. Zhuo et al. (2019): The paper is at https://www.hindawi.com/journals/cin/2019/8106073/. The code is at https://github.com/JhuoW/CAHNE and evaluation done in classify.py.

42. Zhang et al. (2019a): The paper is at https://www.ijcai.org/proceedings/2019/594. The code is at https:

//github.com/THUDM/ProNE and evaluation done in classifier.py.

43. Yang et al. (2019): The paper is at https://dl.acm.org/doi/abs/10.1145/3292500.3330951. The code is at https://github.com/eXascaleInfolab/NodeSketch and evaluation done in scoring kernel hamming.py.

44. Akbas and Aktas (2019): The paper is at https://ieeexplore.ieee.org/abstract/document/9006142. The code is at https://github.com/esraabil/NECL and evaluation done in classify.py.

45. Wang et al. (2020b): The paper is at https://dl.acm.org/doi/abs/10.1145/3340531.3412041. The code is at https://github.com/SoftWiser-group/RankNE/. scoring.py’s evaluation function contains the following, which follows the trend from the above papers:

# find out how many labels should be predicted

**top_k_list = [np.sum(y_test_[i]) for i in range(y_test_.shape[0])]**

# print(’top_k_list’, top_k_list[:10]) preds = clf.predict(X_test, top_k_list)

46. Li et al. (2021): The paper is at https://www.jair.org/index.php/jair/article/view/12567. The code is at https://github.com/RingBDStack/RWNE/. code/eval classify.py’s classify thread body function contains the following, almost identical to the above:

# find out how many labels should be predicted

**top_k_list = [np.sum(Y_test[i]) for i in range(np.size(Y_test,axis=0))]**

clf = TopKRanker(LogisticRegression()) clf.fit(X_train, Y_train)

preds = clf.predict(X_test, top_k_list)

47. Xiao et al. (2020): The paper is at https://epubs.siam.org/doi/abs/10.1137/1.9781611976236.67. The code is at https://github.com/HKUST-KnowComp/vertex-reinforced-random-walk. Evaluation is done inside cogdl/tasks/unsupervised node classification.py’s evaluate function:

# find out how many labels should be predicted

**top_k_list = list(map(int, y_test.sum(axis=1).T.tolist()[0]))**
preds = clf.predict(X_test, top_k_list)

48. Cen et al. (2021): The paper is at https://arxiv.org/abs/2103.00959. The code is at https://github.com/t hudm/cogdl. Predictions were done at cogdl/tasks/unsupervised node classification.py. The following code shows that the number of labels is used in prediction function.

**class** TopKRanker(OneVsRestClassifier):

**def** predict(self, X, top_k_list):

**assert X.shape[0] == len(top_k_list)**

**probs = np.asarray(super(TopKRanker, self).predict_proba(X))**
all_labels = sp.lil_matrix(probs.shape)

**for** **i, k in enumerate(top_k_list):**

probs_ = probs[i, :]

labels = self.classes_[probs_.argsort()[-k:]].tolist()
**for** **label in labels:**

all_labels[i, label] = 1
**return** all_labels

From the code below we can see the number of labels is passed into the prediction function above.

clf = TopKRanker(LogisticRegression(solver="liblinear")) clf.fit(X_train, y_train)

# find out how many labels should be predicted

**top_k_list = list(map(int, y_test.sum(axis=1).T.tolist()[0]))**
preds = clf.predict(X_test, top_k_list)

result = f1_score(y_test, preds, average="micro") all_results[train_percent].append(result)

49. Liu et al. (2020b): The paper is at https://dl.acm.org/doi/abs/10.1145/3340531.3411910. The code is at https://github.com/smufang/meta-tail2vec/. The following can be found in multilabel task.py:

# find out how many labels should be predicted, same as deepwalk
**top_k_list = [len(l) for l in y_test__]**

preds = clf.predict(X_test, top_k_list)

50. Chanpuriya et al. (2021): The paper is at http://proceedings.mlr.press/v139/chanpuriya21a/chanpuriy a21a.pdf. The code is at https://github.com/konsotirop/Invert Embeddings. In predict.py, the function construct indicator utilizes the number of labels to generate predictions:

**def** construct_indicator(y_score, y):

# rank the labels by the scores directly
**num_label = np.sum(y, axis=1, dtype=np.int)**
y_sort = np.fliplr(np.argsort(y_score, axis=1))
**y_pred = np.zeros_like(y, dtype=np.int)**

**for** **i in range(y.shape[0]):**

# print(type(i), num_label.shape, num_label[i])
**for** **j in range(num_label[i]):**

y_pred[i, y_sort[i, j]] = 1
**return** y_pred

y pred returned by the above is then used to calculated Micro- and Macro-f1 scores in the main predict cv function.

y_score = clf.predict_proba(X_test)

y_pred = construct_indicator(y_score, y_test) mi = f1_score(y_test, y_pred, average="micro") ma = f1_score(y_test, y_pred, average="macro")

51. Lutov et al. (2019): The paper is at https://ieeexplore.ieee.org/abstract/document/9006038. The code is at https://github.com/eXascaleInfolab/GraphEmbEval. This evaluation framework similarly uses a scoring classif.py that includes a TopKRanker class predicting labels according to known numbers.

**class** TopKRanker(OneVsRestClassifier):

**def** predict(self, gram_test, top_k_list):

**assert gram_test.shape[0] == len(top_k_list)**

**probs = super(TopKRanker, self).predict_proba(gram_test)**
**if not isinstance(probs, np.ndarray):**

probs = probs.toarray()

all_labels = []

**for** **i, k in enumerate(top_k_list):**

probs_ = probs[i]

# Fetch test labels

labels = self.classes_[probs_.argsort()[-k:]].tolist() all_labels.append(labels)

**return** all_labels

52. Chen et al. (2018): The paper is at https://ojs.aaai.org/index.php/AAAI/article/view/11849. The code is at https://github.com/GTmac/HARP. The scoring function inside src/scoring.py contains the following:

# find out how many labels should be predicted
**top_k_list = [len(l) for l in y_test]**

preds = clf.predict(X_test, top_k_list)

53. Yang et al. (2017): The paper is at https://www.ijcai.org/proceedings/2017/544. The code is at https:

//github.com/thunlp/NEU. The main program neu.m calls the evaluate function and by default generates prediction through evaluation by ranking via instance, shown below:

**function** [pred] = evaluate_by_ranking_via_instance(pred, Y)

% rank the labels by the scores directly
**[n, k] = size(Y);**

**[val, index] = sort(full(pred), 2, ’descend’);**

**numlabel = sum(Y,2); % the number of labels for each instance**
**clear** val;

pred = construct_indicator(index, numlabel);

**pred = sparse(pred);**

In addition, the main evaluation function contains the following comment:

% suppose we know the number of labels for ground truth

54. Tsitsulin et al. (2021): The paper is at https://dl.acm.org/doi/abs/10.14778/3447689.3447713. The code is at https://github.com/xgfs/FREDE. DeepWalk’s performance in the paper is evaluated using the known number of labels, as seen inside src/classification.py’s evaluate deepwalk function:

clf = TopKRanker(LogisticRegression(solver=’liblinear’)) clf.fit(X_train, Y_train)

**top_k_list = [l.nnz for l in Y_test]**

preds = clf.predict(X_test, top_k_list)

55. Liu et al. (2021): The paper is at https://ieeexplore.ieee.org/abstract/document/9380483. The code is at https://github.com/Qidong-Liu/TriATNE. Inside eval/functions.py, the get f1 function uses the true number of labels:

# the true number of labels for each node
**num_labels = numpy.sum(y, axis=1)**

# we take the best k label predictions for all nodes, where k is the true number of labels

pred_reshaped = []

**pred_set = set()**

**for** **pr, num in zip(pred_sorted, num_labels):**

pred_reshaped.append(pr[-num:].tolist()) pred_set.update(pr[-num:])

56. Sheikh et al. (2019): The paper is at https://link.springer.com/article/10.1007/s00607-018-0622-9. The code is at https://github.com/snash4/GAT2VEC. Inside src/GAT2VEC/evaluation/classification.py, the following function explicitly uses label information:

**def** fit_and_predict_multilabel(self, clf, X_train, X_test, y_train, y_test):

""" predicts and returns the top k labels for multi-label classification k depends on the number of labels in y_test."""

clf.fit(X_train, y_train)

y_pred_probs = clf.predict_proba(X_test) pred_labels = []

nclasses = y_test.shape[1]

**top_k_labels = [np.nonzero(label)[0].tolist() for label in y_test]**

**for** **i in range(len(y_test)):**

**k = len(top_k_labels[i])**
probs_ = y_pred_probs[i, :]

**labels_ = tuple(np.argsort(probs_).tolist()[-k:])**
pred_labels.append(labels_)

57. Mitra et al. (2020): The paper is at https://epubs.siam.org/doi/abs/10.1137/1.9781611976236.55. The code is at https://github.com/sonaidgr8/USS NMF. Inside main algo.py, the construct indicator function is used to generate predictions by ranking labels’ scores:

**def** construct_indicator(y_score, y):

# rank the labels by the scores directly
**num_label = np.sum(y, axis=1, dtype=np.int)**
y_sort = np.fliplr(np.argsort(y_score, axis=1))
**y_pred = np.zeros_like(y, dtype=np.int)**

**for** **i in range(y.shape[0]):**

**for** **j in range(num_label[i]):**

y_pred[i, y_sort[i, j]] = 1
**return** y_pred

The above is used in the main function as follows:

y_pred = construct_indicator(predictions, labels[pred_ids, :])

58. Zhu et al. (2021): The paper is at https://epubs.siam.org/doi/abs/10.1137/1.9781611976700.19. The code is at https://github.com/GemsLab/PhUSION. Inside src/eval/predict.py, the exact same evaluation structure is followed as in the previous paper, differing only in variable names used.

59. Liu et al. (2020a): The paper is at https://ieeexplore.ieee.org/abstract/document/9121704. The code is at https://github.com/Qidong-Liu/ANE. In eval/functions.py, the get f1 function explicitly uses the true number of labels in prediction:

# the true number of labels for each node
**num_labels = numpy.sum(y, axis=1)**

# we take the best k label predictions for all nodes, where k is the true number of labels

pred_reshaped = []

**pred_set = set()**

**for** **pr, num in zip(pred_sorted, num_labels):**

pred_reshaped.append(pr[-num:].tolist()) pred_set.update(pr[-num:])

60. Chen et al. (2017): The paper is at https://arxiv.org/abs/1702.05764. The code is downloaded from
https://users.ece.cmu.edu/^{∼}sihengc/code.zip. In scoring.py, the same strategy is again used as in
Perozzi et al. (2014), with top k set as true in all evaluations done in run.sh.

**if** topk:

# find out how many labels should be predicted
**top_k_list = [len(l) for l in y_test]**

preds = clf.predict(X_test, top_k_list)

preds = label2onehot(preds, labels_matrix.toarray().shape[1])
**else:**

preds = clf.predict(X_test)

61. Zhang and Xu (2019): The paper is at https://arxiv.org/abs/1912.00303. The code is at https://github.com /hanzh015/MANELA. To score node classification, the scoring.py program is modified from the original DeepWalk paper, using the number of labels in prediction:

# find out how many labels should be predicted
**top_k_list = [len(l) for l in y_test]**

preds = clf.predict(X_test, top_k_list)

62. Schumacher et al. (2020): The paper is at https://arxiv.org/abs/2005.10039. The code is at https://gi thub.com/SGDE2020/embedding stability. Inside downstream classification/classify.py, regardless of the classifier used, multilabel predictions for both training and testing sets are done by first ranking probabilities in descending order and then selecting the top k, where k is equal to the number of true labels. The following is from the get predictions function, where k is np.count nonzero(labels).

**if** classification_type == ClassificationType.MULTILABEL:

**if isinstance(train_proba, list):**

**train_proba = list(true_prop(train_proba))**
train_proba = np.array(train_proba).T
**test_proba = list(true_prop(test_proba))**
test_proba = np.array(test_proba).T
target_classes = node_labels_train.shape[-1]

train_pred = np.zeros((node_labels_train.shape[0], target_classes))
test_pred = np.zeros((node_labels_test.shape[0], target_classes))
**for** **i, labels in enumerate(node_labels_train):**

train_pred[i][np.argsort(-train_proba[i])[0:np.count_nonzero(labels)]] = 1
**for** **i, labels in enumerate(node_labels_test):**

test_pred[i][np.argsort(-test_proba[i])[0:np.count_nonzero(labels)]] = 1

63. Qiu et al. (2021): The paper is at https://dl.acm.org/doi/abs/10.1145/3448016.3457329. The code is at https://github.com/xptree/LightNE. In LightNE/predict.py, the function below shows that the number of labels is used to construct prediction results.

**def** construct_indicator(y_score, y):

# rank the labels by the scores directly
**num_label = y.sum(axis=1, dtype=np.int32)**

# num_label = np.sum(y, axis=1, dtype=np.int) y_sort = np.fliplr(np.argsort(y_score, axis=1))

#y_pred = np.zeros_like(y_score, dtype=np.int32) row, col = [], []

**for** **i in range(y_score.shape[0]):**

row += [i] * num_label[i, 0]

col += y_sort[i, :num_label[i, 0]].tolist()

#for j in range(num_label[i, 0]):

# y_pred[i, y_sort[i, j]] = 1 y_pred = sp.csr_matrix(

**([1] * len(row), (row, col)),**
shape=y.shape, dtype=np.bool_)
**return** y_pred

The code below shows that the above function is used when evaluating multi-label classification results.

y_score = clf.predict_proba(X_test)

y_pred = construct_indicator(y_score, y_test) mi = f1_score(y_test, y_pred, average="micro") ma = f1_score(y_test, y_pred, average="macro")

64. Cheng et al. (2019): The paper is at https://link.springer.com/article/10.1007/s11390-019-1934-8. The code is at https://github.com/daweicheng/BHONEM. In method/classify.py, the function below shows that the classifier takes number of labels as argument.

**class** TopKRanker(OneVsRestClassifier):

**def** predict(self, X, top_k_list):

**probs = numpy.asarray(super(TopKRanker, self).predict_proba(X))**
all_labels = []

**for** **i, k in enumerate(top_k_list):**

probs_ = probs[i, :]

labels = self.classes_[probs_.argsort()[-k:]].tolist() probs_[:] = 0

probs_[labels] = 1

all_labels.append(probs_)
**return** numpy.asarray(all_labels)

The code below shows that prediction function takes number of labels as input.

**top_k_list = [len(l) for l in Y]**

Y_ = self.predict(X, top_k_list)

65. Liu et al. (2018): The paper is at http://www.mlgworkshop.org/2018/papers/MLG2018 paper 5.pdf. The code is at https://github.com/uestcnlp/GEN. The code below shows that the number of labels is used to construct new prediction results.

predlist = softmax[i][0]

**top_k = len(y_test[h])**

**top_k_list = heapq.nlargest(top_k, range(len(predlist)), predlist.__getitem__)**
**top_k_list = [j for j in top_k_list]**

preds[h] = top_k_list

66. He et al. (2019): The paper is at https://dl.acm.org/doi/pdf/10.1145/3357384.3358061. The code is at https://github.com/HKUST-KnowComp/HeteSpaceyWalk. The code below shows that the number of labels is used to reconstruct prediction results. Inside eval classify.py, the TopKRanker class used the number of labels as arguments and returned the same number of prediction result for each entry.

# find out how many labels should be predicted

**top_k_list = [np.sum(Y_test[i]) for i in range(np.size(Y_test,axis=0))]**

clf = TopKRanker(LogisticRegression()) clf.fit(X_train, Y_train)

preds = clf.predict(X_test, top_k_list)

67. Gu et al. (2020): The paper is at https://proceedings.neurips.cc/paper/2020/file/8b5c8441a8ff8e151b191 c53c1842a38-Paper.pdf. The code is at https://github.com/SwiftieH/IGNN/tree/main/nodeclassification.