Towards Efficient Conversational Recommendations: Expected Value of Information Meets Bandit Learning

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: User modeling, personalization and recommendation
Keywords: Conversational Recommendation, Preference Elicitation, Multi-Armed Bandit, Online Learning
TL;DR: We integrate the Expected Value of Information (EVOI) into conversational bandits, proposing two new algorithms that improve query selection and outperform existing methods in both Bayesian and frequentist settings.
Abstract: In conversational recommender systems, interactively presenting queries and leveraging user feedback are crucial for efficiently estimating user preferences and improving recommendation quality. Selecting optimal queries in these systems is a significant challenge that has been extensively studied as a sequential decision problem. The expected value of information (EVOI), which computes the expected reward improvement, provides a principled criterion for query selection. However, it is computationally expensive and lacks theoretical performance guarantees. Conversely, conversational bandits offer provable regret upper bounds, but their query selection strategies yield only marginal regret improvements over non-conversational approaches. To address these limitations, we integrate EVOI within the conversational bandit framework by proposing a new conversational mechanism featuring two key techniques: (1) gradient-based EVOI, which replaces the complex Bayesian updates in conventional EVOI with efficient stochastic gradient descent, significantly reducing computational complexity and facilitating theoretical analysis; and (2) smoothed key term contexts, which enhance exploration by adding random perturbations to uncover more specific user preferences. Our approach applies to both Bayesian (Thompson Sampling) and frequentist (UCB) variants of conversational bandits. We introduce two new algorithms, ConTS-EVOI and ConUCB-EVOI, and rigorously prove that they achieve substantially tighter regret bounds, with both algorithms offering a $\sqrt{d}$ improvement in their dependence on the time horizon $T$, where $d$ is the dimension of the feature space. Extensive evaluations on synthetic and real-world datasets validate the effectiveness of our methods.
Submission Number: 2779
Loading