Compute inverse propensity scores based on a label distribution. Propensity scores for extreme multi-label learning are proposed in Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Aug, 935–944. doi:10.1145/2939672.2939756 .

compute_propensity_scores(label_distribution, a = 0.55, b = 1.5)

Arguments

label_distribution

Expects a data.frame with columns "label_id", "label_freq", "n_docs". label_freq corresponds to the number of occurences a label has in the gold standard. n_docs corresponds to the total number of documents in the gold standard.

a

A numeric parameter for the propensity score calculation, defaults to 0.55.

b

A numeric parameter for the propensity score calculation, defaults to 1.5.

Value

A data.frame with columns "label_id", "label_weight".

Examples


library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.4      stringr   1.6.0
#>  forcats   1.0.1      tibble    3.3.0
#>  lubridate 1.9.4      tidyr     1.3.1
#>  readr     2.1.5     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(casimir)

label_distribution <- dnb_label_distribution

compute_propensity_scores(label_distribution)
#> # A tibble: 7,772 × 2
#>    label_id  label_weight
#>    <chr>            <dbl>
#>  1 041321634         2.67
#>  2 041321650         1.99
#>  3 041608607         5.66
#>  4 042388120         1.91
#>  5 042718368         3.08
#>  6 043049168         2.87
#>  7 040118827         1.04
#>  8 040320553         2.51
#>  9 040340139         1.76
#> 10 041303059         5.51
#> # ℹ 7,762 more rows