Enabling non-discrimination for end-users of recommender systems by introducing consumer fairness is a key problem. Current research has led to a variety of notions, metrics, and unfairness mitigation procedures. Nevertheless, only around half of the published studies are reproducible. When comparing the existing approaches under the same protocol, we get unexpected outcomes, such as the minority group not always being the disadvantaged one.
In our ECIR 2022 paper, with Gianni Fenu, Mirko Marras, and Giacomo Medda, we analyzed the research landscape on consumer fairness in recommender systems, and benchmarked the existing approaches under a unified evaluation protocol.
Mitigation Procedures Collection
To collect existing mitigation procedures against consumer fairness, we systematically scanned the recent proceedings of top-tier Information Retrieval conferences, workshops, and journals, and retrieved their source code. We processed the data sets used in our evaluation protocol, formatted them as per each mitigation requirements, and made the format of the mitigation results uniform. We trained the recommendation models included in the original papers, with/out mitigation, and computed fairness and utility metrics for the target recommendation task.
Finally, 15 relevant papers were considered in our study, but only 8 of them were deemed as reproducible. For each reproducible paper, we identified:
- the recommendation task (RP: Rating Prediction; TR: Top-N Recommendation);
- the notion of consumer fairness (EQ: equity of the error/utility score across demographic groups; IND: independence of the predicted relevance scores or recommendations from the demographic group);
- the consumers’ grouping (G: Gender, A: Age, O: Occupation, B: Behavioral), the mitigation type (PRE-, IN- or POST-Processing);
- the evaluation data sets (ML: MovieLens 1M or 10M, LFM: LastFM 1K or 360K, AM: Amazon, SS: Sushi, SY: Synthetic);
- the utility/accuracy metrics (NDCG: Normalized Discounted Cumulative Gain; F1: F1 Score; AUC: Area Under Curve; MRR: Mean Reciprocal Rank; RMSE: Root Mean-Square Error; MAE: Mean Absolute Error);
- the fairness metrics (EPS: -fairness; CHI: Chi-Square Test; KS:Kolmogorov-Smirnov Test; GEI: Generalized Entropy Index; TI: Theil Index; DP: Demographic Parity; EP: Equal Opportunity; CES: Category Equity Score; GLV: Group Loss Variance).
Mitigation Procedures Reproduction
For each reproducible paper, we delve into the core idea and the characteristics:
- Burke et al. proposed to generate recommendations for a user from a neighborhood having an equal number of peers from each group, to reduce unfairness;
- Frisch et al. aimed at producing fair recommendations using a co-clustering of users and items that respects statistical parity w.r.t. some sensitive attributes;
- Li et al. investigated consumer unfairness across user groups based on the level of activity in the platform (more or less active);
- Ekstrand et al. re-sampled user interactions (random sampling without replacement), such that the representation of user interactions across groups in the training set was balanced, and re-trained the recommendation models with the balanced training set;
- Kamishima et al. delved into the concept of recommendation independence, achieved when a recommendation outcome (predicted ratings) is statistically independent from a specified sensitive attribute;
- Rastegarpanah et al. investigated whether augmenting the training input with additional data can improve the fairness of the resulting predictions;
- Ashokan & Haas adjusted the relevance scores predicted by the original model such that a given fairness metric increased;
- Wu et al. focused on mitigating unfairness in latent factor models. To this end, their procedure took the user and item embeddings from the original recommendation model as input and learned a filter space where any sensitive information was obfuscated and recommendation utility was preserved.
Mitigation Procedures Evaluation
We used the source code provided by the original authors to run their models and mitigation procedures, and our own artifacts (data and source code) to (a) pre-process the input data sets as per their requirements and (b) compute evaluation metrics based on the relevance scores or recommendations they returned.
Readers can refer to our paper for the detailed results. Here are the main take-home messages from our results:
- Impact on recommendation utility. In general, the mitigation procedures did not substantially impact on recommendation utility, regardless of the sensitive attribute, data set, task. The impact is larger in LFM 1K than ML 1M.
- Impact on group unfairness. Unfairness depends on the mitigation, model, and fairness notion. Often the mitigation impact is small. Lowering DP does not imply lowering KS, and vice versa. Unfairness is higher in LFM than ML.
- Relationships between representation and unfairness. The disparate impact does not always harm the minority group. The latter was advantaged for both attributes in LFM 1K (TR), in both data sets for age and in LFM 1K for gender (RP)
1 thought on “Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations”
Comments are closed.