ICLR 2023 ⭐ Oral · Ranked 5/4176

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection

LBC: A unified framework for behavior control in deep RL — superhuman performance with a fraction of the data

Jiajun Fan ¹, Yuzheng Zhuang ¹, Yuecheng Liu ¹, Jianye Hao ¹, Bin Wang ¹, Jiangcheng Zhu ¹, Hao Wang ², Shu-Tao Xia ¹

¹Tsinghua University / Huawei Noah's Ark Lab ²Rutgers University

📄 Paper (OpenReview) 🏠 Author Homepage

World Records Broken

Atari Human World Records

10,077%

Mean Human-Normalized Score

500×

More Sample-Efficient than Agent57

Training Frames (vs 78B for Agent57)

TL;DR — Population-based RL methods improve exploration by running diverse policies, but are fundamentally limited by a fixed, predefined population. LBC breaks this limitation by learning a hybrid behavior mapping over all policies, enabling a dramatically enlarged behavior space — and achieves superhuman performance with 500× less data.

🧩 The Core Idea

Population-based methods fix a set of exploratory policies and select between them. LBC instead constructs a continuous, learnable behavior mapping space that blends all policies, then uses a bandit-based meta-controller to learn which behaviors to select at each moment:

📐 Hybrid Behavior Mapping

Instead of selecting from a fixed population, LBC parameterizes a convex combination space over all policies — infinite diversity from a finite set of base agents.

🎰 Bandit Meta-Controller

A lightweight bandit algorithm learns which behavior mapping to activate for each episode, balancing exploration across the behavior space with exploitation of known good behaviors.

🔗 Off-Policy Integration

LBC is integrated into distributed off-policy actor-critic methods — compatible with existing RL infrastructure without major architectural changes.

📊 Unified Perspective

Provides a unified view of diverse RL algorithms as special cases of behavior control, opening new directions for understanding exploration in deep RL.

🏆 24 World Records Broken

LBC broke 24 Atari human world records within just 1 billion training frames:

Alien 👾

Amidar

Assault 🔫

Asterix ⭐

Atlantis 🌊

Battle Zone

Beam Rider

Centipede 🐛

Gopher

Kangaroo 🦘

Krull

Ms. Pac-Man

Phoenix

Q*bert

Road Runner 🏎

Seaquest 🐠

Tutankham

Up'n Down

Video Pinball

Wizard of Wor

Yars Revenge

Zaxxon

+ 2 more

Method	Human-Norm. Score	Frames Used	World Records
Agent57 (DeepMind)	~1,079%	78 Billion	0
NGU (DeepMind)	~1,698%	10 Billion	0
R2D2 (DeepMind)	~4,421%	10 Billion	~3
LBC (Ours) 🏆	10,077% Best	1 Billion 78× less	24 Records!

📅 Publication Journey

Sep 2022

Submitted to ICLR 2023 (Submission #219)

Submitted to The Eleventh International Conference on Learning Representations.

Nov 2022

Reviews received

Paper received strong interest from reviewers and area chairs.

Jan 2023

✅ Accepted — Notable Top 5% · Oral · Ranked 5/4,176

Accepted as an oral presentation — ranked 5th out of 4,176 submissions. OpenReview

May 2023

Presented at ICLR 2023 · Kigali, Rwanda

📖 BibTeX

@inproceedings{fan2023learnable,
  title={Learnable Behavior Control: Breaking Atari Human World Records
         via Sample-Efficient Behavior Selection},
  author={Jiajun Fan and Yuzheng Zhuang and Yuecheng Liu and Jianye HAO
          and Bin Wang and Jiangcheng Zhu and Hao Wang and Shu-Tao Xia},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=FeWvD0L_a4}
}