ICLR 2023 ⭐ Oral · Ranked 5/4176

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection

LBC: A unified framework for behavior control in deep RL — superhuman performance with a fraction of the data

Jiajun Fan1, Yuzheng Zhuang 1, Yuecheng Liu 1, Jianye Hao 1, Bin Wang 1, Jiangcheng Zhu 1, Hao Wang 2, Shu-Tao Xia 1
1Tsinghua University / Huawei Noah's Ark Lab    2Rutgers University
World Records Broken
24
Atari Human World Records
10,077%
Mean Human-Normalized Score
500×
More Sample-Efficient than Agent57
1B
Training Frames (vs 78B for Agent57)
TL;DR — Population-based RL methods improve exploration by running diverse policies, but are fundamentally limited by a fixed, predefined population. LBC breaks this limitation by learning a hybrid behavior mapping over all policies, enabling a dramatically enlarged behavior space — and achieves superhuman performance with 500× less data.

🧩 The Core Idea

Population-based methods fix a set of exploratory policies and select between them. LBC instead constructs a continuous, learnable behavior mapping space that blends all policies, then uses a bandit-based meta-controller to learn which behaviors to select at each moment:

📐 Hybrid Behavior Mapping

Instead of selecting from a fixed population, LBC parameterizes a convex combination space over all policies — infinite diversity from a finite set of base agents.

🎰 Bandit Meta-Controller

A lightweight bandit algorithm learns which behavior mapping to activate for each episode, balancing exploration across the behavior space with exploitation of known good behaviors.

🔗 Off-Policy Integration

LBC is integrated into distributed off-policy actor-critic methods — compatible with existing RL infrastructure without major architectural changes.

📊 Unified Perspective

Provides a unified view of diverse RL algorithms as special cases of behavior control, opening new directions for understanding exploration in deep RL.

🏆 24 World Records Broken

LBC broke 24 Atari human world records within just 1 billion training frames:

Alien 👾
Amidar
Assault 🔫
Asterix ⭐
Atlantis 🌊
Battle Zone
Beam Rider
Centipede 🐛
Gopher
Kangaroo 🦘
Krull
Ms. Pac-Man
Phoenix
Q*bert
Road Runner 🏎
Seaquest 🐠
Tutankham
Up'n Down
Video Pinball
Wizard of Wor
Yars Revenge
Zaxxon
+ 2 more
MethodHuman-Norm. ScoreFrames UsedWorld Records
Agent57 (DeepMind)~1,079%78 Billion0
NGU (DeepMind)~1,698%10 Billion0
R2D2 (DeepMind)~4,421%10 Billion~3
LBC (Ours) 🏆 10,077% Best 1 Billion 78× less 24 Records!

📊 Quantitative Comparison

LBC (Ours)
📈 10,077% mean human-normalized score
🎯 24 world records
1B training frames
Agent57 (Prior SOTA)
📈 4,359% mean human-normalized score
🎯 0 world records
500B+ training frames (500× more)
R2D2
📈 4,038% mean human-normalized score
78B training frames

📅 Publication Journey

Sep 2022
Submitted to ICLR 2023 (Submission #219)
Submitted to The Eleventh International Conference on Learning Representations.
Nov 2022
Reviews received
Paper received strong interest from reviewers and area chairs.
Jan 2023
✅ Accepted — Notable Top 5% · Oral · Ranked 5/4,176
Accepted as an oral presentation — ranked 5th out of 4,176 submissions. OpenReview
May 2023
Presented at ICLR 2023 · Kigali, Rwanda

📖 BibTeX

@inproceedings{fan2023learnable,
  title={Learnable Behavior Control: Breaking Atari Human World Records
         via Sample-Efficient Behavior Selection},
  author={Jiajun Fan and Yuzheng Zhuang and Yuecheng Liu and Jianye HAO
          and Bin Wang and Jiangcheng Zhu and Hao Wang and Shu-Tao Xia},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=FeWvD0L_a4}
}