Compute multi-label metrics — compute_set_retrieval

Compute multi-label metrics precision, recall, F1 and R-precision for subject indexing results.

compute_set_retrieval_scores(
  predicted,
  gold_standard,
  k = NULL,
  mode = "doc-avg",
  compute_bootstrap_ci = FALSE,
  n_bt = 10L,
  doc_groups = NULL,
  label_groups = NULL,
  graded_relevance = FALSE,
  rename_metrics = FALSE,
  seed = NULL,
  propensity_scored = FALSE,
  label_distribution = NULL,
  cost_fp_constant = NULL,
  replace_zero_division_with = options::opt("replace_zero_division_with"),
  drop_empty_groups = options::opt("drop_empty_groups"),
  ignore_inconsistencies = options::opt("ignore_inconsistencies"),
  verbose = options::opt("verbose"),
  progress = options::opt("progress")
)

compute_set_retrieval_scores_dplyr(
  predicted,
  gold_standard,
  k = NULL,
  mode = "doc-avg",
  compute_bootstrap_ci = FALSE,
  n_bt = 10L,
  doc_groups = NULL,
  label_groups = NULL,
  graded_relevance = FALSE,
  rename_metrics = FALSE,
  seed = NULL,
  propensity_scored = FALSE,
  label_distribution = NULL,
  cost_fp_constant = NULL,
  ignore_inconsistencies = FALSE,
  verbose = FALSE,
  progress = FALSE
)

Arguments

predicted

Multi-label prediction results. Expects a data.frame with columns "label_id", "doc_id".

gold_standard

Expects a data.frame with columns "label_id", "doc_id".

k

An integer limit on the number of predictions per document to consider. Requires a column "score" in input predicted.

mode

One of the following aggregation modes: "doc-avg", "subj-avg", "micro".

compute_bootstrap_ci

A logical indicator for computing bootstrap CIs.

n_bt

An integer number of resamples to be used for bootstrapping.

doc_groups

A two-column data.frame with a column "doc_id" and a second column defining groups of documents to stratify results by. It is recommended that groups are of type factor so that levels are not implicitly dropped during bootstrap replications.

label_groups

A two-column data.frame with a column "label_id" and a second column defining groups of labels to stratify results by. Results in each stratum will restrict gold standard and predictions to the specified label groups as if the vocabulary was consisting of the label group only. All modes "doc-avg", "subj-avg", "micro" are supported within label strata. Nevertheless, mixing mode = "doc-avg" with fine-grained label strata can result in many missing values on document-level results. Also rank-based thresholding (e.g. top 5) will result in inhomogeneous numbers of labels per document within the defined label strata. mode = "subj-avg" or mode = "micro" can be more appropriate in these circumstances.

graded_relevance

A logical indicator for graded relevance. Defaults to FALSE for binary relevance. If set to TRUE, the predicted data.frame should contain a numeric column "relevance" with values in the range of \([0, 1]\).

rename_metrics

If set to TRUE, the metric names in the output will be renamed if:

graded_relevance == TRUE: prefixed with "g-" to indicate that metrics are computed with graded relevance.
propensity_scored == TRUE: prefixed with "ps-" to indicate that metrics are computed with propensity scores.
!is.null(k): suffixed with "@k" to indicate that metrics are limited to top k predictions.

seed

Pass a seed to make bootstrap replication reproducible.

propensity_scored

Logical, whether to use propensity scores as weights.

label_distribution

Expects a data.frame with columns "label_id", "label_freq", "n_docs". label_freq corresponds to the number of occurences a label has in the gold standard. n_docs corresponds to the total number of documents in the gold standard.

cost_fp_constant

Constant cost assigned to false positives. cost_fp_constant must be a numeric value > 0 or one of 'max', 'min', 'mean' (computed with reference to the gold_standard label distribution). Defaults to NULL, i.e. label weights are applied to false positives in the same way as to false negatives and true positives.

replace_zero_division_with

In macro averaged results (doc-avg, subj-avg), it may occur that some instances have no predictions or no gold standard. In these cases, calculating precision and recall may lead to division by zero. CASIMiR standardly removes these missing values from macro averages, leading to a smaller support (count of instances that were averaged). Other implementations of macro averaged precision and recall default to 0 in these cases. This option allows to control the default. Set any value between 0 and 1. (Defaults to NULL, overwritable using option 'casimir.replace_zero_division_with' or environment variable 'R_CASIMIR_REPLACE_ZERO_DIVISION_WITH')

drop_empty_groups

Should empty levels of factor variables be dropped in grouped set retrieval computation? (Defaults to TRUE, overwritable using option 'casimir.drop_empty_groups' or environment variable 'R_CASIMIR_DROP_EMPTY_GROUPS')

ignore_inconsistencies

Warnings about data inconsistencies will be silenced. (Defaults to FALSE, overwritable using option 'casimir.ignore_inconsistencies' or environment variable 'R_CASIMIR_IGNORE_INCONSISTENCIES')

verbose

Verbose reporting of computation steps for debugging. (Defaults to FALSE, overwritable using option 'casimir.verbose' or environment variable 'R_CASIMIR_VERBOSE')

progress

Display progress bars for iterated computations (like bootstrap CI or pr curves). (Defaults to FALSE, overwritable using option 'casimir.progress' or environment variable 'R_CASIMIR_PROGRESS')

Value

A data.frame with columns "metric", "mode", "value", "support" and optional grouping variables supplied in doc_groups or label_groups. Here, support is defined for each mode as:

mode == "doc-avg": The number of tested documents.
mode == "subj-avg": The number of labels contributing to the subj-average.
mode == "micro": The number of doc-label pairs contributing to the denominator of the respective metric, e.g. \(tp + fp\) for precision, \(tp + fn\) for recall, \(tp + (fp + fn)/2\) for F1 and \(min(tp + fp, tp + fn)\) for R-precision.

Functions

compute_set_retrieval_scores_dplyr(): Variant with internal usage of dplyr rather than collapse library. Tends to be slower, but more stable.

Examples


library(tidyverse)
library(casimir)
library(furrr)

gold <- tibble::tribble(
  ~doc_id, ~label_id,
  "A", "a",
  "A", "b",
  "A", "c",
  "B", "a",
  "B", "d",
  "C", "a",
  "C", "b",
  "C", "d",
  "C", "f",
)

pred <- tibble::tribble(
  ~doc_id, ~label_id,
  "A", "a",
  "A", "d",
  "A", "f",
  "B", "a",
  "B", "e",
  "C", "f",
)

plan(sequential) # or whatever resources you have

a <- compute_set_retrieval_scores(
  pred, gold,
  mode = "doc-avg",
  compute_bootstrap_ci = TRUE,
  n_bt = 100L
)

ggplot(a, aes(x = metric, y = value)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) +
  facet_wrap(vars(metric), scales = "free")


# example with graded relevance
pred_w_relevance <- tibble::tribble(
  ~doc_id, ~label_id, ~relevance,
  "A", "a", 1.0,
  "A", "d", 0.0,
  "A", "f", 0.0,
  "B", "a", 1.0,
  "B", "e", 1 / 3,
  "C", "f", 1.0,
)

b <- compute_set_retrieval_scores(
  pred_w_relevance, gold,
  mode = "doc-avg",
  graded_relevance = TRUE
)