TL;DR — Sample efficiency and final performance are two classic RL challenges. GDI addresses both simultaneously by showing that training data distribution is the key lever — unifying diverse RL algorithms and achieving 9620% mean human-normalized score with only 200M frames (500× less than Agent57).
💡 Core Insight
GDI decouples RL challenges into two problems and casts both into training data distribution optimization:
Data richness → control the capacity and diversity of behavior policy
Exploration-exploitation → fine-grained adaptive control of sampling distribution
GDI integrates this into Generalized Policy Iteration (GPI), providing operator-based versions of well-known RL methods from DQN to Agent57 — all as special cases of GDI.
The key formula: Generalized Bellman Operator with a data distribution operator D(·) that is jointly optimized with the value function — turning the data collection strategy itself into a learnable parameter.
@InProceedings{pmlr-v162-fan22c,
title = {Generalized Data Distribution Iteration},
author = {Fan, Jiajun and Xiao, Changnan},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {6103--6184},
year = {2022},
editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
volume = {162},
series = {Proceedings of Machine Learning Research},
month = {17--23 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v162/fan22c/fan22c.pdf},
url = {https://proceedings.mlr.press/v162/fan22c.html}
}