2021
AlphaStar
Discussion on AlphaStar, the first agent that achieves Grandmaster level in the full game of StarCraft II
OpenAI Five
Discussion on OpenAI Five, an agent that achieves super-human performance in Dota 2
Go-Explore
Discussion on Go-Explore, a family of algorithms designed for hard-exploration games
GAIL — Generative Adversarial Imitation Learning
A concise theoretical analysis of GAIL
FTW — For The Win
Discussion on an agent, namely For The Win(FTW), that achieves human-level performance in a popular 3D team-based multiplayer first-person video game.
MuZero
Discussion on MuZero, a successor of AlphaZero, that not only masters chess games but also achieves state-of-the-art performance on Atari games
AlphaZero
Discussion on AlphaZero, an agent that achieves super-human performance in chess, shogi and Go
MAPPO
Discussion on Multi-Agent PPO, which includes a few tricks when applying PPO to multi-agent environments
QMIX and Some Tricks
Discussion on QMIX and some tricks on QMIX.
NCC — Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning
Discussion on NCC, a cooperative MARL method that takes into account neighborhood cognitive consistency.
RODE — Learning Roles to Decompose Multi-Agent Tasks
Discussion on RODE, a hierarchical MARL method that decompose the action space into role action subspaces according to their effects on the environment.
PWIL — Primal Wasserstein Imitation Learning
Discussion on Primal Wasserstein Imitation Learning.
Network Regularization in Policy Optimization
Discussion on the effect of network regularization in policy optimization.
HIDIO — Hierarchical RL by Discovering Intrinsic Options
Discussion on HIDIO, which identifies and addresses the problem of using a shared representation for learning the policy and the value function.
IDAAC — Invariant Decoupled Advantage Actor-Critic
Discussion on IDAAC, which identifies and addresses the problem of using a shared representation for learning the policy and the value function.
DTSIL — Diverse Trajectory-conditioned Self-Imitation Learning
Discussion on Diverse Trajectory-conditioned Self-Imitation Learning,
TAC — Tsallis Actor Critic
Discussion on Tsallis Actor Critic
MARL — A Survey and Critique
We present an overview of multi-agent reinforcement learning
C++ Concurrency in Action — Chapter 9
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 8
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 7
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 6
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 5
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 4
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 3
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 2
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 10
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 10
Notes from Williams’ C++ Concurrency in Action
C++ Concurrency in Action — Chapter 1
Notes from Williams’ C++ Concurrency in Action
2020
PPG — Phasic Policy Gradient
Discussion on phasic policy gradient, which implements two disjoint networks for the policy and value function and optimizes them in two phases.
Deep Reinforcement Learning and its Neuroscientific Implications
Notes from Deep Reinforcement Learning and Its Neuroscientific Implications
Backward — Learning from a Single Demonstration
Discussion on a curriculum learning algorithm that gradually learns a policy gradient algorithm on Montezuma’s Revenge
The Mirage of Action-Dependent Baselines
Analysis on action-dependent baselines
Self-Tuning Reinforcement Learning
A self-tuning reinforcement learning algorithm for IMPALA.
V-trace
Theoretical analysis of the V-trace target.
Retrace(𝝀)
A theoretical analysis of the Retrace(𝝀) algorithm.
M-RL — Munchausen Reinforcement Learning
Discussion on Munchausen Reinforcement Learning, which considers policy in Bellman updates.
Behavior Priors for Kl regularized Reinforcement Learning
Discussion on behavior priors for KL regularized reinforcement learning
A Unified View of KL-Regularized RL
We present a unified view of policy gradient and soft Q-learning.
Hide and Seek
Discussion on an agent developed by OpenAI et al. that exhibits several emergent strategies in hide-and-seek environment.
P3O — Policy-on Policy-off Policy Optimization
Discussion on P3O, an policy gradient method that utilizes both on-policy and off-policy data.
Reactor — Retrace Actor
Discussion on 𝛽-LOO.
What Matters In On-Policy Reinforcement Learning?
Discussion on several design decisions on on-policy reinforcement learning
MPO — Maximum a Posteriori Policy Optimization
Discussion on maximum a posteriori policy optimization, a KL-regularized reinforcement learning method.
MERLIN — Memory, RL, and Inference Network
Discussion on a memory architecture that allows us to do temporal relational reasoning.
Spectral Norm
Discussion on Spectral norm and its usage in deep learning
TransformerXL
Discussion on a successor of Transformer, namely TransformerXL, that can learn from sequences beyond a fixed length
Generalization in RL
Discussion on several recent works trying to improve generalization in deep reinforcement learning.
Network Randomization
Discussion on network randomization, a techinque improving generalization in reinforcement learning.
Efficient Value-Based RL
Discussion on several recent works trying to improve sample efficiency of reinforcement learning algorithms.
The Deadly Triad
We analyze how different components of DQN play a role in emergence of the deadly triad
TPPO — Truly PPO
We investigate the behavior of PPO and introduce new methods that forces the trust region constraint.
3rd-place solution to MineRL 2019 Competition
Discussion on the 3rd-place solution to MineRL 2019 Competition.
Anti-Aliasing
Discussion on aliasing in modern convolutional neural networks and address it with low-pass filters.
SENet: Squeeze-and Excitation Network
Discussion on Squeeze-and Excitation Network, an architecture that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
EvoNorm
Discussion on EvoNorm, a set of uniform normalization-activation layers found by AutoML.
MobileNet
Discussion on MobileNet families
Math
We summarize some mathematical concepts used in deep reinforcement learning
Combining EAs with RL
We summarize summarize several recent works that combine evolutionary algorithms with reinforcement learning.
CLEAR — Continual Learning with Experience And Replay
Discussion on continual learning with experience and replay, a simple method preventing catastrophic forgetting and improve stability of learning.
Agent57
Discussion on an agent, called Agent57, that outperforms the standard human benchmark on all Atari games.
NGU — Never Give Up
Discussion on the Never-Give-Up(NGV) agent that achieves the state-of-the-art performance in hard exploration games in Atari without any prior knowledge while maintraining a very high score across the remaining games.
From 1st Wasserstein to Kantorovich-Rubinstein Duality
An introduction to the dual of the 1st Wasserstein distance.
Duality in Linear Programm
An introduction to dual linear programs
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
Discussion on some useful results about how to learn a good representation
GCN, GLU — Gated Convolutional Network
Discussion on Gated Convolutional Network that applies 1D convolution to sequential data.
EC — Episodic Curiosity
Discussion on an exploration method based on episodic memory.
FiLM — Feature-wise Linear Modulation
Discussion on Feature-wise Linear Modulation
DreamerV2
Discussion on DreamerV2, a model-based algorithm reaching promising results on Atari games
Dreamer
Discussion on a model-based reinforcement learning agent called Dreamer
PlaNet: Deep Planning Network
Discussion on a model-based reinforcement learning agent called PlaNet
SIL - Self-Imitation Learning
Discussion on self-imitation learning, in which the agent exploits the previous transitions that receives better returnas than it expects
AdaNorm
We analyze layer normalization and discuss its improvement AdaNorm.
UNREAL — Unsupervised Reinforcement and Auxiliary Learning
Discussion on UNsupervised Reinforcement and Auxiliary Learning(UNREAL), which aims to fully utilize training signals from environments to speed up the learning process and gain better performance.
Time Limits in Reinforcement Learning
Discussion on the impact of time limits in reinforcement learning
PtrNet: Pointer Network
Discussion on Pointer Network.
Ape-X DQfD
Discussion on several enhancements on Ape-X DQN.
2019
Solving Rubik’s Cube with a Robot Hand
Discussion on an agent, trained on simulation, can solve Rubik’s Cube with a real robot hand.
Challenges of Real-World Reinforcement Learning
Discussion on several challenges of real-world reinforcement learning.
REM - Random Ensemble Mixture
Discussion on a RL algorithm that exploit off-policy data.
BCQ — Batch-Constrained Deep Q-Learning
Discussion on a RL algorithm that exploit off-policy data.
Diagnosing Bottlenecks in DQN
Discussion on several concerns in deep (Q) learning.
SEED — Scalable Efficient Deep-RL
Discussion on a scalable reinforcement learning architecture that speeds up both data collection and learning process.
R2D2: Recurrent Replay Distributed DQN
Discussion on a distributed reinforcement learning architecture that incoporates a recurrent network into Ape-X.
IMPALA
Discussion on a distributed reinforcement learning architecture for policy gradient methods.
Ape-X
Discussion on a distributed reinforcement learning architecture for Q-learning methods.
DNC — Improving Differentiable Neural Computer
Discussion on several improvements on differentiable neural computer.
DNC — Differentiable Neural Computer
Discussion on Differentiable Neural Computer.
NTM — Neural Turing Machines
Discussion on Neural Turing Machines, an architecture able to utilize an external memory.
HPG — Hindsight Policy Gradients
Discussion on a policy-gradient method with hindsight experience
PopArt: Preserving Outputs Precisely, while Adaptively Rescaling Targets
Discussion on a method that can learn values across many orders of magnitudes.
SchedNet — Schedule Network
Discussion on a multi-agent reinforcement learning algorithm that schedules communication between cooperative agents.
PR2 — Probabilistic Recursive Reasoning
Discussion on a multi-agent reinforcement learning algorithm that recursively reason the opponents’ behavior.
MADDPG — Multi-Agent-Deep deterministic Policy Gradient
Discussion on a multi-agent reinforcement learning algorithm that follows the framework of centralized training with decentralized execution.
EMI — Exploration with Mutual Information
Discussion on a novel exploration method based on representation learning
QWeb
Discussion on how to solve the web navigation problem using DQN.
MIRL — Mutual Information Reinforcement Learning
Discussion on a new regularization mechanism that leverage an optimal prior to explicitly penalize the mutual information between states and f.
SAGAN: Techniques in Self-Attention Generative Adversarial Networks
Discussion on several techniques involved in SAGAN, including self-attention, spectral normalization, conditional batch normalization, etc
MB-MRL — Model-Based Meta-Reinforcement Learning
Discussion on a model-based meta reinforcement learning algorithm that enables the agent to fast adapt to changes of environment.
PEARL — Probabilistic Embedding for Actor-critic RL
Discussion on an off-policy meta reinforcement learning algorithm that achieves state-of-the-art performance and sample efficiency.
MB-MPO — Model-Based Meta-Policy Optimization
Discussion on an algorithm that efficiently learns a robust policy by applying MAML to multiple dynamics model.
ProMP — Proximal MetaPolicy Search
We address the credit assignment problem of two forms of MAML with an RL objective and discuss an efficient and stable meta reinforcement learning algorithm.
Adaptive MAML — Applying MAML-RL to nonstationary environments
Discussion on a variant of MAML-RL for solving tasks that change dynamically due to non-stationary of the environment.
MAML++: Improvements on MAML
Discussion on a series of improvements on MAML
MAML — Model-Agnostic Meta-Learning
Discussion on an optimization algorithm for meta-learning named Model-Agnostic Meta-Learning(MAML)
SNAIL — Simple Neural AttentIve meta-Learner
Discussion on a meta-learning architecture named Simple Neural AttentIve meta-Learner(SNAIL).
HAC — Learning Multi-Level Hierarchies with Hindsight
A norvel hierarchical reinforcement learning frame work that can efficiently learn multiple levels of policies in parallel.
NORL-HRL — Near-Optimal Representation Learning for Hierarchical Reinforcement Learning
Near-Optimal Representation Learning for Hierarchical Reinforcement Learning: An improvement to HIRO
HIRO — HIerarchical Reinforcement learning with Off-policy correction
Discussion on a hierarchical reinforcement learning algorithm for goal-directed tasks.
Hierarchical Guidance
Discussion on an algorithmic framework called hierarchical guidance, which leverages hierarchical structure in imitation learning.
SAC-X — Scheduled Auxiliary Control
Discussion on a new learning paradigm in RL that resorts to auxiliary policies to efficiently explore the environment.
TDM — Temporal Difference Models
Discussion on temporal difference models, an algorithm that tries to gain sample efficiency of model-based RL, while achieving asymptotic performance as model-free RL
RMC — Relational Memory Core
Discussion on a recurrent architecture that allows us to do temporal relational reasoning.
Exponential Families
Discussion on Exponential Famlies
FQF — Fully Parameterized Quantile Function
Discussion on fully parameterized quantile function, which improves IQN by further parameterizing the quantile proposal process.
QR-DQN, IQN
Discussion on two distributional deep Q networks, namely Quantile Regression Deep Q Network(QR-DQN) and Implicit Quantile Networks
ICM, RND
Discussion on two exploration methods based on curiosity, namely Intrinsic Curiosity Module (ICM) and Random Network Distillation(RND)
Some Exploration Algorithms: EX2, LSH, VIME etc.
Discussion on several exploration algorithms, including count-based methods, Thompson sampling, and information gain exploration.
DIAYN — Diversity Is All You Need
Discussion on an unsupervised learning method for learning useful skills without a reward function.
Transformer
Discussion on a self-attention architecture named Transformer.
AIRL — Adversarial Inverse Reinforcement Learning
We introduce a practical GAN-style IRL algorithm named adversarial inverse reinforcement learning(AIRL)
GAN-GCL
We build a connection between maximum entropy inverse reinforcement learning and generative adversarial networks
GCL — Guided Cost Learning
We introduce a maximum entropy inverse reinforcement learning algorithm, named guided policy learning.
PCL — Path Consistency Learning and More
Discussion on path consistency learning and its derivatives.
SAC — Soft Actor-Critic with Adaptive Temperature
We introduce adaptive temperature to soft actor-critic(SAC).
SAC — Soft Actor-Critic
Discussion on soft actor-critic, a maximum entropy algorithm.
SVI — Soft Value Iteration
We address the optimism problem of the probabilistic graphical model introduced in the previous post via variational inference.
PGM — Probabilistic Graphic Model
Discussion on statistic inference in a temporal probabilistic graphical model.
SL — Statistic Learning: A Connection to Neural Networks
We expand the topic of latent variable models in a sense that the latent variables model the underlying structure of the observed data, whereby the model is able to do statistical inference over these latent variables. Then we will build a connnection between statistic learning and neural networks.
Probabilistic Latent Variable Models
Introduction
2018
EM — Expectation-Maximization Algorithm
Discussion on the Expectation-Maximization(EM) algorithm, and its application to GMMs
GPS-iLQR — Guided Policy Search with iLQR
Discussion on iterative Linear Quadratic Regulator with a local linear-Gaussian model
LQR — Linear-Quadratic Regulator
Discussion on Linear Quadratic Regulator its derivatives
MB-MF — Model-Based Model-Free
Discussion on model-based model-free algorithm
SCG — Stochastic Computational Graphs
Discussion on stochastic computational graphs, a type of directed asyclic computational graph that include both deterministic functions and conditional probability distrbutions.
GAE — Generalized Advantage Estimation
Discussion on a multi-step advantage estimation for online reinforcement learning
TRPO, PPO
Discussion on two policy-based algorithms which restrict the step size to help avoid big steps: Trust Region Policy Optimization(TRPO) and Proximal Policy Optimization(PPO).
CG — Conjugate Gradient Method
Discussion on the conjugate gradient method in chaos :-)
Planning and Learning in Model-Based Reinforcement Learning Methods
Discussion on a series of algorithms in model-based reinforcement learning where planning and learning are intermixed.
GQN — Generative Query Network
Discussion on the generative query network, a brand new unsupervised scene-based generative network.
Rainbow
Discussion on Rainbow, an integration of multiple improvements on DQN.
c51 — Distributional Deep Q Network
Discussion on the distributional deep Q network(a.k.a. c51), an improvement to deep Q network which replaces action-value Q with the value distribution to take on the stochastic nature of the environment.
PER — Prioritized Experience Replay
Discussion on prioritized experience replay, an improvement to the uniform experience replay used in deep Q network.
PG — Stochastic & Deterministic Policy Gradient
Discussion on policy gradient methods and its derivatives
IS — Importance Sampling
Discussion on importance sampling, the cornerstone of off-policy learning.
Basic Policies in Reinforcement Learning
We talk in detail about some wildly used policy in reinforcement learning, including epsilon-greedy policy, stochastic policy with temperature, upper confidence bound(UCB), and gradient bandit algorithm
DQN — Deep Q Network
Discussion on Deep Q network(DQN), a successful algorithm works in discrete-action environments
Contrastive Predicting Coding
Discussion on a sequential representation learning model, contrastive predicting coding.
Beta-VAE and Its Variants
Discussion on beta-VAE and its variants, which attempt to learn disentangled representation by heavily penalizing the corresponding correlation term
DIM — Deep INFOMAX
Discussion on Deep INFOMAX, a representation-learning method maximizing mutual information between the input and its representation based on MINE
MINE — Mutual Information Neural Estimation
Discussion on a neural estimator for mutual information, and some of its applications
R-CNN — Region-based Methods for Object Detection
Discussion on a series of region-based methods for object detection, and extend to Mask R-CNN for instance segmentation
GANs — Generative Adversarial Networks
Discussion on the generative adversarial network in two ways: one for data generation, and the other for semi-supervised learning. In the end, we’ll also demonstrate some techniques that help improve GANs
VAE — Variational Autoencoder
Discussion on variational autoencoders, a kind of generative networks which allows us to alter data in a desired, specific way
t-SNE
Discussion on t-SNE, an unsupervised learning algorithm commonly used in data visualization.
YOLO — You Only Look Once
Discussion on YOLO, a state-of-the-art real-time object detection algorithm
PCA and Whitening
Discussion on dimensionality reduction technique PCA, and its derivatives whitening and ZCA whitening
Optimization
Discussion on first-order optimization algorithms in machine learning, which optimize the objective function based on gradients.
SVM — Support Vector Machines
An introduction to support vector machines