โ† Jiajun Fan / Publications / LBC
ICLR 2023 โญ Oral ยท Ranked 5/4176

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection

LBC: A unified framework for behavior control in deep RL โ€” superhuman performance with a fraction of the data

Jiajun Fan1, Yuzheng Zhuang 1, Yuecheng Liu 1, Jianye Hao 1, Bin Wang 1, Jiangcheng Zhu 1, Hao Wang 2, Shu-Tao Xia 1
1Tsinghua University / Huawei Noah's Ark Lab    2Rutgers University
๐Ÿ“„ Paper (OpenReview) ๐Ÿ  Author Homepage
World Records Broken
24
Atari Human World Records
10,077%
Mean Human-Normalized Score
500ร—
More Sample-Efficient than Agent57
1B
Training Frames (vs 78B for Agent57)
TL;DR โ€” Population-based RL methods improve exploration by running diverse policies, but are fundamentally limited by a fixed, predefined population. LBC breaks this limitation by learning a hybrid behavior mapping over all policies, enabling a dramatically enlarged behavior space โ€” and achieves superhuman performance with 500ร— less data.

๐Ÿงฉ The Core Idea

Population-based methods fix a set of exploratory policies and select between them. LBC instead constructs a continuous, learnable behavior mapping space that blends all policies, then uses a bandit-based meta-controller to learn which behaviors to select at each moment:

๐Ÿ“ Hybrid Behavior Mapping

Instead of selecting from a fixed population, LBC parameterizes a convex combination space over all policies โ€” infinite diversity from a finite set of base agents.

๐ŸŽฐ Bandit Meta-Controller

A lightweight bandit algorithm learns which behavior mapping to activate for each episode, balancing exploration across the behavior space with exploitation of known good behaviors.

๐Ÿ”— Off-Policy Integration

LBC is integrated into distributed off-policy actor-critic methods โ€” compatible with existing RL infrastructure without major architectural changes.

๐Ÿ“Š Unified Perspective

Provides a unified view of diverse RL algorithms as special cases of behavior control, opening new directions for understanding exploration in deep RL.

๐Ÿ† 24 World Records Broken

LBC broke 24 Atari human world records within just 1 billion training frames:

Alien ๐Ÿ‘พ
Amidar
Assault ๐Ÿ”ซ
Asterix โญ
Atlantis ๐ŸŒŠ
Battle Zone
Beam Rider
Centipede ๐Ÿ›
Gopher
Kangaroo ๐Ÿฆ˜
Krull
Ms. Pac-Man
Phoenix
Q*bert
Road Runner ๐ŸŽ
Seaquest ๐Ÿ 
Tutankham
Up'n Down
Video Pinball
Wizard of Wor
Yars Revenge
Zaxxon
+ 2 more
MethodHuman-Norm. ScoreFrames UsedWorld Records
Agent57 (DeepMind)~1,079%78 Billion0
NGU (DeepMind)~1,698%10 Billion0
R2D2 (DeepMind)~4,421%10 Billion~3
LBC (Ours) ๐Ÿ† 10,077% Best 1 Billion 78ร— less 24 Records!

๐Ÿ“… Publication Journey

Sep 2022
Submitted to ICLR 2023 (Submission #219)
Submitted to The Eleventh International Conference on Learning Representations.
Nov 2022
Reviews received
Paper received strong interest from reviewers and area chairs.
Jan 2023
โœ… Accepted โ€” Notable Top 5% ยท Oral ยท Ranked 5/4,176
Accepted as an oral presentation โ€” ranked 5th out of 4,176 submissions. OpenReview
May 2023
Presented at ICLR 2023 ยท Kigali, Rwanda

๐Ÿ“– BibTeX

@inproceedings{fan2023learnable,
  title={Learnable Behavior Control: Breaking Atari Human World Records
         via Sample-Efficient Behavior Selection},
  author={Jiajun Fan and Yuzheng Zhuang and Yuecheng Liu and Jianye HAO
          and Bin Wang and Jiangcheng Zhu and Hao Wang and Shu-Tao Xia},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=FeWvD0L_a4}
}