compute_propensity_scores.RdCompute inverse propensity scores based on a label distribution. Propensity scores for extreme multi-label learning are proposed in Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Aug, 935–944. doi:10.1145/2939672.2939756 .
compute_propensity_scores(label_distribution, a = 0.55, b = 1.5)Expects a data.frame with columns "label_id",
"label_freq", "n_docs". label_freq corresponds to the number of
occurences a label has in the gold standard. n_docs corresponds to
the total number of documents in the gold standard.
A numeric parameter for the propensity score calculation, defaults to 0.55.
A numeric parameter for the propensity score calculation, defaults to 1.5.
A data.frame with columns "label_id", "label_weight".
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ stringr 1.6.0
#> ✔ forcats 1.0.1 ✔ tibble 3.3.0
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
#> ✔ readr 2.1.5
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(casimir)
label_distribution <- dnb_label_distribution
compute_propensity_scores(label_distribution)
#> # A tibble: 7,772 × 2
#> label_id label_weight
#> <chr> <dbl>
#> 1 041321634 2.67
#> 2 041321650 1.99
#> 3 041608607 5.66
#> 4 042388120 1.91
#> 5 042718368 3.08
#> 6 043049168 2.87
#> 7 040118827 1.04
#> 8 040320553 2.51
#> 9 040340139 1.76
#> 10 041303059 5.51
#> # ℹ 7,762 more rows