Compute inverse propensity scores — compute_propensity

Compute inverse propensity scores based on a label distribution. Propensity scores for extreme multi-label learning are proposed in Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Aug, 935–944. doi:10.1145/2939672.2939756 .

compute_propensity_scores(label_distribution, a = 0.55, b = 1.5)

Arguments

label_distribution: Expects a data.frame with columns "label_id", "label_freq", "n_docs". label_freq corresponds to the number of occurences a label has in the gold standard. n_docs corresponds to the total number of documents in the gold standard.
a: A numeric parameter for the propensity score calculation, defaults to 0.55.
b: A numeric parameter for the propensity score calculation, defaults to 1.5.

Value

A data.frame with columns "label_id", "label_weight".

Examples


library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ stringr   1.6.0
#> ✔ forcats   1.0.1     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ readr     2.1.6     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(casimir)

label_distribution <- dnb_label_distribution

compute_propensity_scores(label_distribution)
#> # A tibble: 7,772 × 2
#>    label_id  label_weight
#>    <chr>            <dbl>
#>  1 041321634         2.67
#>  2 041321650         1.99
#>  3 041608607         5.66
#>  4 042388120         1.91
#>  5 042718368         3.08
#>  6 043049168         2.87
#>  7 040118827         1.04
#>  8 040320553         2.51
#>  9 040340139         1.76
#> 10 041303059         5.51
#> # ℹ 7,762 more rows