LBC: A unified framework for behavior control in deep RL โ superhuman performance with a fraction of the data
Population-based methods fix a set of exploratory policies and select between them. LBC instead constructs a continuous, learnable behavior mapping space that blends all policies, then uses a bandit-based meta-controller to learn which behaviors to select at each moment:
Instead of selecting from a fixed population, LBC parameterizes a convex combination space over all policies โ infinite diversity from a finite set of base agents.
A lightweight bandit algorithm learns which behavior mapping to activate for each episode, balancing exploration across the behavior space with exploitation of known good behaviors.
LBC is integrated into distributed off-policy actor-critic methods โ compatible with existing RL infrastructure without major architectural changes.
Provides a unified view of diverse RL algorithms as special cases of behavior control, opening new directions for understanding exploration in deep RL.
LBC broke 24 Atari human world records within just 1 billion training frames:
| Method | Human-Norm. Score | Frames Used | World Records |
|---|---|---|---|
| Agent57 (DeepMind) | ~1,079% | 78 Billion | 0 |
| NGU (DeepMind) | ~1,698% | 10 Billion | 0 |
| R2D2 (DeepMind) | ~4,421% | 10 Billion | ~3 |
| LBC (Ours) ๐ | 10,077% Best | 1 Billion 78ร less | 24 Records! |
@inproceedings{fan2023learnable,
title={Learnable Behavior Control: Breaking Atari Human World Records
via Sample-Efficient Behavior Selection},
author={Jiajun Fan and Yuzheng Zhuang and Yuecheng Liu and Jianye HAO
and Bin Wang and Jiangcheng Zhu and Hao Wang and Shu-Tao Xia},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=FeWvD0L_a4}
}