Tag Archive

Value-Based RL

Retrace(𝝀)

A theoretical analysis of the Retrace(𝝀) algorithm.

M-RL — Munchausen Reinforcement Learning

Discussion on Munchausen Reinforcement Learning, which considers policy in Bellman updates.

A Unified View of KL-Regularized RL

We present a unified view of policy gradient and soft Q-learning.

Efficient Value-Based RL

Discussion on several recent works trying to improve sample efficiency of reinforcement learning algorithms.

The Deadly Triad

We analyze how different components of DQN play a role in emergence of the deadly triad

Agent57

Discussion on an agent, called Agent57, that outperforms the standard human benchmark on all Atari games.

Discussion on the Never-Give-Up(NGV) agent that achieves the state-of-the-art performance in hard exploration games in Atari without any prior knowledge while maintraining a very high score across the remaining games.

Diagnosing Bottlenecks in DQN

Discussion on several concerns in deep (Q) learning.

PopArt: Preserving Outputs Precisely, while Adaptively Rescaling Targets

Discussion on a method that can learn values across many orders of magnitudes.

FQF — Fully Parameterized Quantile Function

Discussion on fully parameterized quantile function, which improves IQN by further parameterizing the quantile proposal process.

QR-DQN, IQN

Discussion on two distributional deep Q networks, namely Quantile Regression Deep Q Network(QR-DQN) and Implicit Quantile Networks

PCL — Path Consistency Learning and More

Discussion on path consistency learning and its derivatives.

SAC — Soft Actor-Critic with Adaptive Temperature

We introduce adaptive temperature to soft actor-critic(SAC).

SAC — Soft Actor-Critic

Discussion on soft actor-critic, a maximum entropy algorithm.

Rainbow

Discussion on Rainbow, an integration of multiple improvements on DQN.

c51 — Distributional Deep Q Network

Discussion on the distributional deep Q network(a.k.a. c51), an improvement to deep Q network which replaces action-value Q with the value distribution to take on the stochastic nature of the environment.

PER — Prioritized Experience Replay

Discussion on prioritized experience replay, an improvement to the uniform experience replay used in deep Q network.

PG — Stochastic & Deterministic Policy Gradient

Discussion on policy gradient methods and its derivatives

DQN — Deep Q Network

Discussion on Deep Q network(DQN), a successful algorithm works in discrete-action environments

Back to Top ↑

Mathematics

GAIL — Generative Adversarial Imitation Learning

A concise theoretical analysis of GAIL

V-trace

Theoretical analysis of the V-trace target.

Retrace(𝝀)

A theoretical analysis of the Retrace(𝝀) algorithm.

Spectral Norm

Discussion on Spectral norm and its usage in deep learning

Math

We summarize some mathematical concepts used in deep reinforcement learning

From 1st Wasserstein to Kantorovich-Rubinstein Duality

An introduction to the dual of the 1st Wasserstein distance.

Duality in Linear Programm

An introduction to dual linear programs

Exponential Families

Discussion on Exponential Famlies

SVI — Soft Value Iteration

We address the optimism problem of the probabilistic graphical model introduced in the previous post via variational inference.

PGM — Probabilistic Graphic Model

Discussion on statistic inference in a temporal probabilistic graphical model.

SL — Statistic Learning: A Connection to Neural Networks

We expand the topic of latent variable models in a sense that the latent variables model the underlying structure of the observed data, whereby the model is able to do statistical inference over these latent variables. Then we will build a connnection between statistic learning and neural networks.

EM — Expectation-Maximization Algorithm

Discussion on the Expectation-Maximization(EM) algorithm, and its application to GMMs

SCG — Stochastic Computational Graphs

Discussion on stochastic computational graphs, a type of directed asyclic computational graph that include both deterministic functions and conditional probability distrbutions.

TRPO, PPO

Discussion on two policy-based algorithms which restrict the step size to help avoid big steps: Trust Region Policy Optimization(TRPO) and Proximal Policy Optimization(PPO).

CG — Conjugate Gradient Method

Discussion on the conjugate gradient method in chaos :-)

PCA and Whitening

Discussion on dimensionality reduction technique PCA, and its derivatives whitening and ZCA whitening

SVM — Support Vector Machines

An introduction to support vector machines

Back to Top ↑

Exploration in RL

Go-Explore

Discussion on Go-Explore, a family of algorithms designed for hard-exploration games

DTSIL — Diverse Trajectory-conditioned Self-Imitation Learning

Discussion on Diverse Trajectory-conditioned Self-Imitation Learning,

Backward — Learning from a Single Demonstration

Discussion on a curriculum learning algorithm that gradually learns a policy gradient algorithm on Montezuma’s Revenge

Behavior Priors for Kl regularized Reinforcement Learning

Discussion on behavior priors for KL regularized reinforcement learning

Agent57

Discussion on an agent, called Agent57, that outperforms the standard human benchmark on all Atari games.

NGU — Never Give Up

Discussion on the Never-Give-Up(NGV) agent that achieves the state-of-the-art performance in hard exploration games in Atari without any prior knowledge while maintraining a very high score across the remaining games.

EC — Episodic Curiosity

Discussion on an exploration method based on episodic memory.

Ape-X DQfD

Discussion on several enhancements on Ape-X DQN.

EMI — Exploration with Mutual Information

Discussion on a novel exploration method based on representation learning

SAC-X — Scheduled Auxiliary Control

Discussion on a new learning paradigm in RL that resorts to auxiliary policies to efficiently explore the environment.

ICM, RND

Discussion on two exploration methods based on curiosity, namely Intrinsic Curiosity Module (ICM) and Random Network Distillation(RND)

Some Exploration Algorithms: EX2, LSH, VIME etc.

Discussion on several exploration algorithms, including count-based methods, Thompson sampling, and information gain exploration.

DIAYN — Diversity Is All You Need

Discussion on an unsupervised learning method for learning useful skills without a reward function.

Basic Policies in Reinforcement Learning

We talk in detail about some wildly used policy in reinforcement learning, including epsilon-greedy policy, stochastic policy with temperature, upper confidence bound(UCB), and gradient bandit algorithm

Back to Top ↑

Multi-Agent RL

AlphaStar

Discussion on AlphaStar, the first agent that achieves Grandmaster level in the full game of StarCraft II

OpenAI Five

Discussion on OpenAI Five, an agent that achieves super-human performance in Dota 2

FTW — For The Win

Discussion on an agent, namely For The Win(FTW), that achieves human-level performance in a popular 3D team-based multiplayer first-person video game.

MuZero

Discussion on MuZero, a successor of AlphaZero, that not only masters chess games but also achieves state-of-the-art performance on Atari games

AlphaZero

Discussion on AlphaZero, an agent that achieves super-human performance in chess, shogi and Go

MAPPO

Discussion on Multi-Agent PPO, which includes a few tricks when applying PPO to multi-agent environments

QMIX and Some Tricks

Discussion on QMIX and some tricks on QMIX.

NCC — Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Discussion on NCC, a cooperative MARL method that takes into account neighborhood cognitive consistency.

RODE — Learning Roles to Decompose Multi-Agent Tasks

Discussion on RODE, a hierarchical MARL method that decompose the action space into role action subspaces according to their effects on the environment.

MARL — A Survey and Critique

We present an overview of multi-agent reinforcement learning

Hide and Seek

Discussion on an agent developed by OpenAI et al. that exhibits several emergent strategies in hide-and-seek environment.

SchedNet — Schedule Network

Discussion on a multi-agent reinforcement learning algorithm that schedules communication between cooperative agents.

PR2 — Probabilistic Recursive Reasoning

Discussion on a multi-agent reinforcement learning algorithm that recursively reason the opponents’ behavior.

MADDPG — Multi-Agent-Deep deterministic Policy Gradient

Discussion on a multi-agent reinforcement learning algorithm that follows the framework of centralized training with decentralized execution.

Back to Top ↑

Policy-Gradient RL

IDAAC — Invariant Decoupled Advantage Actor-Critic

Discussion on IDAAC, which identifies and addresses the problem of using a shared representation for learning the policy and the value function.

PPG — Phasic Policy Gradient

Discussion on phasic policy gradient, which implements two disjoint networks for the policy and value function and optimizes them in two phases.

The Mirage of Action-Dependent Baselines

Analysis on action-dependent baselines

Self-Tuning Reinforcement Learning

A self-tuning reinforcement learning algorithm for IMPALA.

A Unified View of KL-Regularized RL

We present a unified view of policy gradient and soft Q-learning.

P3O — Policy-on Policy-off Policy Optimization

Discussion on P3O, an policy gradient method that utilizes both on-policy and off-policy data.

What Matters In On-Policy Reinforcement Learning?

Discussion on several design decisions on on-policy reinforcement learning

MPO — Maximum a Posteriori Policy Optimization

Discussion on maximum a posteriori policy optimization, a KL-regularized reinforcement learning method.

TPPO — Truly PPO

We investigate the behavior of PPO and introduce new methods that forces the trust region constraint.

HPG — Hindsight Policy Gradients

Discussion on a policy-gradient method with hindsight experience

GAE — Generalized Advantage Estimation

Discussion on a multi-step advantage estimation for online reinforcement learning

TRPO, PPO

Discussion on two policy-based algorithms which restrict the step size to help avoid big steps: Trust Region Policy Optimization(TRPO) and Proximal Policy Optimization(PPO).

PG — Stochastic & Deterministic Policy Gradient

Discussion on policy gradient methods and its derivatives

Back to Top ↑

Model-Based RL

MuZero

Discussion on MuZero, a successor of AlphaZero, that not only masters chess games but also achieves state-of-the-art performance on Atari games

AlphaZero

Discussion on AlphaZero, an agent that achieves super-human performance in chess, shogi and Go

DreamerV2

Discussion on DreamerV2, a model-based algorithm reaching promising results on Atari games

Dreamer

Discussion on a model-based reinforcement learning agent called Dreamer

PlaNet: Deep Planning Network

Discussion on a model-based reinforcement learning agent called PlaNet

MB-MRL — Model-Based Meta-Reinforcement Learning

Discussion on a model-based meta reinforcement learning algorithm that enables the agent to fast adapt to changes of environment.

MB-MPO — Model-Based Meta-Policy Optimization

Discussion on an algorithm that efficiently learns a robust policy by applying MAML to multiple dynamics model.

TDM — Temporal Difference Models

Discussion on temporal difference models, an algorithm that tries to gain sample efficiency of model-based RL, while achieving asymptotic performance as model-free RL

GPS-iLQR — Guided Policy Search with iLQR

Discussion on iterative Linear Quadratic Regulator with a local linear-Gaussian model

LQR — Linear-Quadratic Regulator

Discussion on Linear Quadratic Regulator its derivatives

MB-MF — Model-Based Model-Free

Discussion on model-based model-free algorithm

Planning and Learning in Model-Based Reinforcement Learning Methods

Discussion on a series of algorithms in model-based reinforcement learning where planning and learning are intermixed.

Back to Top ↑

RL Application

AlphaStar

Discussion on AlphaStar, the first agent that achieves Grandmaster level in the full game of StarCraft II

OpenAI Five

Discussion on OpenAI Five, an agent that achieves super-human performance in Dota 2

FTW — For The Win

Discussion on an agent, namely For The Win(FTW), that achieves human-level performance in a popular 3D team-based multiplayer first-person video game.

MuZero

Discussion on MuZero, a successor of AlphaZero, that not only masters chess games but also achieves state-of-the-art performance on Atari games

AlphaZero

Discussion on AlphaZero, an agent that achieves super-human performance in chess, shogi and Go

Hide and Seek

Discussion on an agent developed by OpenAI et al. that exhibits several emergent strategies in hide-and-seek environment.

3rd-place solution to MineRL 2019 Competition

Discussion on the 3rd-place solution to MineRL 2019 Competition.

Solving Rubik’s Cube with a Robot Hand

Discussion on an agent, trained on simulation, can solve Rubik’s Cube with a real robot hand.

QWeb

Discussion on how to solve the web navigation problem using DQN.

Back to Top ↑

Regularized RL

TAC — Tsallis Actor Critic

Discussion on Tsallis Actor Critic

M-RL — Munchausen Reinforcement Learning

Discussion on Munchausen Reinforcement Learning, which considers policy in Bellman updates.

Behavior Priors for Kl regularized Reinforcement Learning

Discussion on behavior priors for KL regularized reinforcement learning

A Unified View of KL-Regularized RL

We present a unified view of policy gradient and soft Q-learning.

MPO — Maximum a Posteriori Policy Optimization

Discussion on maximum a posteriori policy optimization, a KL-regularized reinforcement learning method.

MIRL — Mutual Information Reinforcement Learning

Discussion on a new regularization mechanism that leverage an optimal prior to explicitly penalize the mutual information between states and f.

PCL — Path Consistency Learning and More

Discussion on path consistency learning and its derivatives.

SAC — Soft Actor-Critic with Adaptive Temperature

We introduce adaptive temperature to soft actor-critic(SAC).

SAC — Soft Actor-Critic

Discussion on soft actor-critic, a maximum entropy algorithm.

Back to Top ↑

Sequential Model

MERLIN — Memory, RL, and Inference Network

Discussion on a memory architecture that allows us to do temporal relational reasoning.

TransformerXL

Discussion on a successor of Transformer, namely TransformerXL, that can learn from sequences beyond a fixed length

GCN, GLU — Gated Convolutional Network

Discussion on Gated Convolutional Network that applies 1D convolution to sequential data.

PtrNet: Pointer Network

Discussion on Pointer Network.

DNC — Improving Differentiable Neural Computer

Discussion on several improvements on differentiable neural computer.

DNC — Differentiable Neural Computer

Discussion on Differentiable Neural Computer.

NTM — Neural Turing Machines

Discussion on Neural Turing Machines, an architecture able to utilize an external memory.

RMC — Relational Memory Core

Discussion on a recurrent architecture that allows us to do temporal relational reasoning.

Transformer

Discussion on a self-attention architecture named Transformer.

Back to Top ↑

Distributed RL

AlphaStar

Discussion on AlphaStar, the first agent that achieves Grandmaster level in the full game of StarCraft II

OpenAI Five

Discussion on OpenAI Five, an agent that achieves super-human performance in Dota 2

FTW — For The Win

Discussion on an agent, namely For The Win(FTW), that achieves human-level performance in a popular 3D team-based multiplayer first-person video game.

Ape-X DQfD

Discussion on several enhancements on Ape-X DQN.

SEED — Scalable Efficient Deep-RL

Discussion on a scalable reinforcement learning architecture that speeds up both data collection and learning process.

R2D2: Recurrent Replay Distributed DQN

Discussion on a distributed reinforcement learning architecture that incoporates a recurrent network into Ape-X.

IMPALA

Discussion on a distributed reinforcement learning architecture for policy gradient methods.

Ape-X

Discussion on a distributed reinforcement learning architecture for Q-learning methods.

Back to Top ↑

Hierarchical RL

FTW — For The Win

Discussion on an agent, namely For The Win(FTW), that achieves human-level performance in a popular 3D team-based multiplayer first-person video game.

RODE — Learning Roles to Decompose Multi-Agent Tasks

Discussion on RODE, a hierarchical MARL method that decompose the action space into role action subspaces according to their effects on the environment.

HIDIO — Hierarchical RL by Discovering Intrinsic Options

Discussion on HIDIO, which identifies and addresses the problem of using a shared representation for learning the policy and the value function.

HAC — Learning Multi-Level Hierarchies with Hindsight

A norvel hierarchical reinforcement learning frame work that can efficiently learn multiple levels of policies in parallel.

NORL-HRL — Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning: An improvement to HIRO

HIRO — HIerarchical Reinforcement learning with Off-policy correction

Discussion on a hierarchical reinforcement learning algorithm for goal-directed tasks.

Hierarchical Guidance

Discussion on an algorithmic framework called hierarchical guidance, which leverages hierarchical structure in imitation learning.

DIAYN — Diversity Is All You Need

Discussion on an unsupervised learning method for learning useful skills without a reward function.

Back to Top ↑

Meta-Learning

MB-MRL — Model-Based Meta-Reinforcement Learning

Discussion on a model-based meta reinforcement learning algorithm that enables the agent to fast adapt to changes of environment.

PEARL — Probabilistic Embedding for Actor-critic RL

Discussion on an off-policy meta reinforcement learning algorithm that achieves state-of-the-art performance and sample efficiency.

MB-MPO — Model-Based Meta-Policy Optimization

Discussion on an algorithm that efficiently learns a robust policy by applying MAML to multiple dynamics model.

ProMP — Proximal MetaPolicy Search

We address the credit assignment problem of two forms of MAML with an RL objective and discuss an efficient and stable meta reinforcement learning algorithm.

Adaptive MAML — Applying MAML-RL to nonstationary environments

Discussion on a variant of MAML-RL for solving tasks that change dynamically due to non-stationary of the environment.

MAML++: Improvements on MAML

Discussion on a series of improvements on MAML

MAML — Model-Agnostic Meta-Learning

Discussion on an optimization algorithm for meta-learning named Model-Agnostic Meta-Learning(MAML)

SNAIL — Simple Neural AttentIve meta-Learner

Discussion on a meta-learning architecture named Simple Neural AttentIve meta-Learner(SNAIL).

Back to Top ↑

Representation Learning

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

Discussion on some useful results about how to learn a good representation

NORL-HRL — Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning: An improvement to HIRO

GQN — Generative Query Network

Discussion on the generative query network, a brand new unsupervised scene-based generative network.

Contrastive Predicting Coding

Discussion on a sequential representation learning model, contrastive predicting coding.

Beta-VAE and Its Variants

Discussion on beta-VAE and its variants, which attempt to learn disentangled representation by heavily penalizing the corresponding correlation term

DIM — Deep INFOMAX

Discussion on Deep INFOMAX, a representation-learning method maximizing mutual information between the input and its representation based on MINE

MINE — Mutual Information Neural Estimation

Discussion on a neural estimator for mutual information, and some of its applications

VAE — Variational Autoencoder

Discussion on variational autoencoders, a kind of generative networks which allows us to alter data in a desired, specific way

Back to Top ↑

Computer Vision

Anti-Aliasing

Discussion on aliasing in modern convolutional neural networks and address it with low-pass filters.

SENet: Squeeze-and Excitation Network

Discussion on Squeeze-and Excitation Network, an architecture that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.

MobileNet

Discussion on MobileNet families

FiLM — Feature-wise Linear Modulation

Discussion on Feature-wise Linear Modulation

R-CNN — Region-based Methods for Object Detection

Discussion on a series of region-based methods for object detection, and extend to Mask R-CNN for instance segmentation

GANs — Generative Adversarial Networks

Discussion on the generative adversarial network in two ways: one for data generation, and the other for semi-supervised learning. In the end, we’ll also demonstrate some techniques that help improve GANs

YOLO — You Only Look Once

Discussion on YOLO, a state-of-the-art real-time object detection algorithm

Back to Top ↑

Generative Network

SAGAN: Techniques in Self-Attention Generative Adversarial Networks

Discussion on several techniques involved in SAGAN, including self-attention, spectral normalization, conditional batch normalization, etc

GQN — Generative Query Network

Discussion on the generative query network, a brand new unsupervised scene-based generative network.

Contrastive Predicting Coding

Discussion on a sequential representation learning model, contrastive predicting coding.

Beta-VAE and Its Variants

Discussion on beta-VAE and its variants, which attempt to learn disentangled representation by heavily penalizing the corresponding correlation term

MINE — Mutual Information Neural Estimation

Discussion on a neural estimator for mutual information, and some of its applications

GANs — Generative Adversarial Networks

Discussion on the generative adversarial network in two ways: one for data generation, and the other for semi-supervised learning. In the end, we’ll also demonstrate some techniques that help improve GANs

VAE — Variational Autoencoder

Discussion on variational autoencoders, a kind of generative networks which allows us to alter data in a desired, specific way

Back to Top ↑

Inverse RL

GAIL — Generative Adversarial Imitation Learning

A concise theoretical analysis of GAIL

AIRL — Adversarial Inverse Reinforcement Learning

We introduce a practical GAN-style IRL algorithm named adversarial inverse reinforcement learning(AIRL)

GAN-GCL

We build a connection between maximum entropy inverse reinforcement learning and generative adversarial networks

GCL — Guided Cost Learning

We introduce a maximum entropy inverse reinforcement learning algorithm, named guided policy learning.

Back to Top ↑

Machine Learning

EM — Expectation-Maximization Algorithm

Discussion on the Expectation-Maximization(EM) algorithm, and its application to GMMs

t-SNE

Discussion on t-SNE, an unsupervised learning algorithm commonly used in data visualization.

PCA and Whitening

Discussion on dimensionality reduction technique PCA, and its derivatives whitening and ZCA whitening

SVM — Support Vector Machines

An introduction to support vector machines

Back to Top ↑

Multitask RL

Behavior Priors for Kl regularized Reinforcement Learning

Discussion on behavior priors for KL regularized reinforcement learning

UNREAL — Unsupervised Reinforcement and Auxiliary Learning

Discussion on UNsupervised Reinforcement and Auxiliary Learning(UNREAL), which aims to fully utilize training signals from environments to speed up the learning process and gain better performance.

SAC-X — Scheduled Auxiliary Control

Discussion on a new learning paradigm in RL that resorts to auxiliary policies to efficiently explore the environment.

DIAYN — Diversity Is All You Need

Discussion on an unsupervised learning method for learning useful skills without a reward function.

Back to Top ↑

Tricks

Network Regularization in Policy Optimization

Discussion on the effect of network regularization in policy optimization.

What Matters In On-Policy Reinforcement Learning?

Discussion on several design decisions on on-policy reinforcement learning

The Deadly Triad

We analyze how different components of DQN play a role in emergence of the deadly triad

Diagnosing Bottlenecks in DQN

Discussion on several concerns in deep (Q) learning.

Back to Top ↑

Generalization in RL

IDAAC — Invariant Decoupled Advantage Actor-Critic

Discussion on IDAAC, which identifies and addresses the problem of using a shared representation for learning the policy and the value function.

Generalization in RL

Discussion on several recent works trying to improve generalization in deep reinforcement learning.

Network Randomization

Discussion on network randomization, a techinque improving generalization in reinforcement learning.

Back to Top ↑

Imitation Learning

PWIL — Primal Wasserstein Imitation Learning

Discussion on Primal Wasserstein Imitation Learning.

Ape-X DQfD

Discussion on several enhancements on Ape-X DQN.

Hierarchical Guidance

Discussion on an algorithmic framework called hierarchical guidance, which leverages hierarchical structure in imitation learning.

Back to Top ↑

Memory in RL

MERLIN — Memory, RL, and Inference Network

Discussion on a memory architecture that allows us to do temporal relational reasoning.

EC — Episodic Curiosity

Discussion on an exploration method based on episodic memory.

RMC — Relational Memory Core

Discussion on a recurrent architecture that allows us to do temporal relational reasoning.

Back to Top ↑

Network Layer

Spectral Norm

Discussion on Spectral norm and its usage in deep learning

EvoNorm

Discussion on EvoNorm, a set of uniform normalization-activation layers found by AutoML.

AdaNorm

We analyze layer normalization and discuss its improvement AdaNorm.

Back to Top ↑

Offline RL

3rd-place solution to MineRL 2019 Competition

Discussion on the 3rd-place solution to MineRL 2019 Competition.

REM - Random Ensemble Mixture

Discussion on a RL algorithm that exploit off-policy data.

BCQ — Batch-Constrained Deep Q-Learning

Discussion on a RL algorithm that exploit off-policy data.

Back to Top ↑

Overviews

MARL — A Survey and Critique

We present an overview of multi-agent reinforcement learning

Deep Reinforcement Learning and its Neuroscientific Implications

Notes from Deep Reinforcement Learning and Its Neuroscientific Implications

Challenges of Real-World Reinforcement Learning

Discussion on several challenges of real-world reinforcement learning.

Back to Top ↑

Curriculum Learning

Backward — Learning from a Single Demonstration

Discussion on a curriculum learning algorithm that gradually learns a policy gradient algorithm on Montezuma’s Revenge

Back to Top ↑

Evolutionary Algorithms

Combining EAs with RL

We summarize summarize several recent works that combine evolutionary algorithms with reinforcement learning.

Back to Top ↑

Lifelong Learning

CLEAR — Continual Learning with Experience And Replay

Discussion on continual learning with experience and replay, a simple method preventing catastrophic forgetting and improve stability of learning.

Back to Top ↑

Meta-Gradient RL

Self-Tuning Reinforcement Learning

A self-tuning reinforcement learning algorithm for IMPALA.

Back to Top ↑

Methodology

The Mirage of Action-Dependent Baselines

Analysis on action-dependent baselines

Back to Top ↑

Optimizer

Optimization

Discussion on first-order optimization algorithms in machine learning, which optimize the objective function based on gradients.

Back to Top ↑

Policy-Based RL

V-trace

Theoretical analysis of the V-trace target.

Back to Top ↑

Self-Imitation Learning

SIL - Self-Imitation Learning

Discussion on self-imitation learning, in which the agent exploits the previous transitions that receives better returnas than it expects

Back to Top ↑

Transfer Learning in RL

Solving Rubik’s Cube with a Robot Hand

Discussion on an agent, trained on simulation, can solve Rubik’s Cube with a real robot hand.

Back to Top ↑

Visualization

t-SNE

Discussion on t-SNE, an unsupervised learning algorithm commonly used in data visualization.

Back to Top ↑