AMBAR: A dataset for Assessing Multiple Beyond-Accuracy Recommenders

Recommender systems are a key tool for personalization in today’s digital age. They help us discover new music, books, or movies by predicting what we might like based on past interactions. But as recommender systems evolve, researchers and practitioners recognize that traditional metrics like accuracy alone aren’t enough. Factors like fairness, diversity, and user satisfaction play a key role in creating equitable and effective systems. In a dataset paper, with Elizabeth Gómez, David Contreras, and Maria Salamó, presented at RecSys ’24, we presented AMBAR, a new dataset in the music domain, designed to evaluate recommender systems on these “beyond-accuracy” perspectives.

Many existing datasets focus on user-item interactions but lack the depth to explore fairness or other nuanced attributes in recommendations. AMBAR addresses these gaps by including demographic attributes like user gender, geographic provenance, and artist information, enabling fairness analysis. Moreover, it allows fine-grained analysis of users, items, and their interactions, paving the way for multi-level and multi-objective evaluations.

Key features of AMBAR

The dataset consists of over 3.3 million ratings, involving:

31,013 users, categorized by gender and geographic provenance.
443,921 tracks, spanning 282 music styles, grouped into 14 broader categories.
30,667 artists, with attributes like gender and origin.

Data is distributed across four CSV files—users, tracks, artists, and ratings—ensuring compatibility with existing machine learning tools.

Beyond-accuracy applications

AMBAR opens to different recommender system applications:

Fairness in recommendations. It facilitates the study of fairness from the perspectives of consumers (users), providers (artists), and subjects (both users and items).
Multi-objective recommendations. Researchers can explore trade-offs between objectives like accuracy, fairness, and calibration.
Calibrated recommendations. AMBAR supports analysis of how well recommendations align with user preferences.

Benchmark results

We benchmarked AMBAR using state-of-the-art algorithms, including Matrix Factorization (MF), Weighted MF, SVD, and Variational Autoencoder Collaborative Filtering (VAECF). Results showed that AMBAR supports:

Fairness evaluation. Algorithms like CPFair and PFair demonstrated the dataset’s utility in addressing fairness for both binary (e.g., gender) and graded (e.g., geographic provenance) attributes.
Multi-objective trade-offs. The dataset enables cross-analysis of user and item properties, such as balancing exposure for artists while maintaining recommendation quality.