You have an environment, a PyTorch model, and a reinforcement learning library that are designed to work together but don’t. PufferLib provides one-line wrappers that make them play nice.

Play Snake vs RL in your browser!

WIP: Play Puffer MOBA vs RL in your browser!

Click to Demo PufferLib in Colab

Join our community Discord for support and Discussion, follow my Twitter for news, and star the repo to feed the puffer. We also have a whitepaper.

You can open this guide in a Colab notebook by clicking the demo button at the top of this page

Emulation#

Complex environments may have heirarchical observations and actions, variable numbers of agents, and other quirks that make them difficult to work with and incompatible with standard reinforcement learning libraries. PufferLib’s emulation layer makes every environment look like it has flat observations/actions and a constant number of agents. Here’s how it works with NetHack and Neural MMO, two notoriously complex environments.

import pufferlib.emulation
import pufferlib.wrappers

import nle, nmmo

def nmmo_creator():
    env = nmmo.Env()
    env = pufferlib.wrappers.PettingZooTruncatedWrapper(env)
    return pufferlib.emulation.PettingZooPufferEnv(env=env)

def nethack_creator():
    return pufferlib.emulation.GymnasiumPufferEnv(env_creator=nle.env.NLE)

The wrappers give you back a Gymnasium/PettingZoo compliant environment. There is no loss of generality and no change to the underlying environment. You can wrap environments by class, creator function, or object, with or without additional arguments. These wrappers enable us to make some optimizations to vectorization code that would be difficult to implement otherwise. You can choose from a variety of vectorization backends. They all share the same interface with synchronous and asynchronous options.

import pufferlib.vector
backend = pufferlib.vector.Serial #or Multiprocessing, Ray
envs = pufferlib.vector.make(nmmo_creator, backend=backend, num_envs=4)

# Synchronous API - reset/step
obs, infos = envs.reset()

# Asynchronous API - async_reset, send/recv
envs.async_reset()
obs, rewards, terminals, truncateds, infos, env_id, mask = envs.recv()

Our backends support asynchronous on-policy sampling through a Python implementation of EnvPool. This makes them faster than the implementations that ship with most RL libraries. We suggest Serial for debugging and Multiprocessing for most training runs. Ray is a good option if you need to scale beyond a single machine.

PufferLib allows you to write vanilla PyTorch policies and use them with multiple learning libraries. We take care of the details of converting between the different APIs. Here’s a policy that will work with any environment, with a one-line wrapper for CleanRL.

import torch
from torch import nn
import numpy as np

import pufferlib.frameworks.cleanrl

class Policy(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.encoder = nn.Linear(np.prod(
            envs.single_observation_space.shape), 128)
        self.decoders = nn.ModuleList([nn.Linear(128, n)
            for n in envs.single_action_space.nvec])
        self.value_head = nn.Linear(128, 1)

    def forward(self, env_outputs):
        env_outputs = env_outputs.reshape(env_outputs.shape[0], -1)
        hidden = self.encoder(env_outputs)
        actions = [dec(hidden) for dec in self.decoders]
        value = self.value_head(hidden)
        return actions, value

obs = torch.Tensor(obs)
policy = Policy(envs.driver_env)
cleanrl_policy = pufferlib.frameworks.cleanrl.Policy(policy)
actions = cleanrl_policy.get_action_and_value(obs)[0].numpy()
obs, rewards, terminals, truncateds, infos = envs.step(actions)
envs.close()

Optionally, you can class break the forward pass into an encode and decode step, which allows us to handle recurrance for you. So far, the code above is fully general and does not rely on PufferLib support for specific environments. For convenience, we also provide environment hooks with standard wrappers and baseline models. Here’s a complete example.

import torch

import pufferlib.models
import pufferlib.vector
import pufferlib.frameworks.cleanrl
import pufferlib.environments.nmmo

make_env = pufferlib.environments.nmmo.env_creator()
envs = pufferlib.vector.make(make_env, backend=backend, num_envs=4)

policy = pufferlib.environments.nmmo.Policy(envs.driver_env)
cleanrl_policy = pufferlib.frameworks.cleanrl.Policy(policy)

env_outputs = envs.reset()[0]
obs = torch.from_numpy(env_outputs)
actions = cleanrl_policy.get_action_and_value(obs)[0].numpy()
next_obs, rewards, terminals, truncateds, infos = envs.step(actions)
envs.close()

It’s that simple – almost. If you have an environment with structured observations, you’ll have to unpack them in the network forward pass since PufferLib will flatten them in emulation. We provide a utility for this.

dtype = pufferlib.pytorch.nativize_dtype(envs.driver_env.emulated)
env_outputs = pufferlib.pytorch.nativize_tensor(obs, dtype)
print('Packed tensor:', obs.shape)
print('Unpacked:', env_outputs.keys())

That’s all you need to get started. The PufferLib repository contains full-length CleanRL scripts with PufferLib integration. Single-agent environments should work with SB3, and other integrations will be based on demand - so let us know what you want!

Vectorization#

Our Multiprocessing backend is fast – much faster than Gymnasium’s in most cases. Atari runs 50-60% faster synchronous and 5x faster async by our latest benchmark, and some environments like NetHack can be 10x faster even synchronous, with no API changes. PufferLib implements the following optimizations:

A Python implementation of EnvPool. Simulates more envs than are needed per batch and returns batches of observations as soon as they are ready. Requires using the async send/recv instead of the sync step API.

Multiple environments per worker. Important for fast environments.

Shared memory. Unlike Gymnasium’s implementation, we use a single buffer that is shared across environments.

Shared flags. Workers busy-wait on an unlocked flag instead of signaling via pipes or queues. This virtually eliminates interprocess communication overhead. Pipes are used once per episode to communicate aggregated infos.

Zero-copy batching. Because we use a single buffer for shared memory, we can return observations from contiguous subsets of workers without ever copying observations. Only does not work for full-async mode.

Native multiagent support. It’s not an extra wrapper or slow bolt-on feature. PufferLib treats single-agent and multi-agent environments the same. API differences are handled at the emulation level.

Most of these optimizations are made possible by a hard assumption on PufferLib emulation. This means that we do not need to handle structured data within the vectorization layer itself.

Libraries#

PufferLib’s emulation layer adheres to the Gym and PettingZoo APIs: you can use it with any environment and learning library (subject to Limitations). The libraries and environments below are just the ones we’ve tested. We also provide additional tools to make them easier to work with.

PufferLib provides pufferlib.frameworks for the the learning libraries below. These are short wrappers over your vanilla PyTorch policy that handles learning library API details for you. Additionally, if you split your forward function into an encode and decode portion, we can handle recurrance for you. This is the approach we use in our default policies.

CleanRL provides single-file RL implementations suited for 80+% of academic research. It was designed for simple environments like Atari, but with PufferLib, you can use it with just about anything.

Click to Demo PufferLib + CleanRL in Colab

Or view it on GitHub here

PufferLib also includes a heavily customized version of CleanRL PPO with support for recurrent and non-recurrent models, async environment execution, variable agent populations, self-play, and experiment management. This is the version we use for our research and the NeurIPS 2023 Neural MMO Competition. You can try it out here

Ray is a general purpose distributed computing framework that includes RLlib, an industry reinforcement learning library.

We have previously supported RLLib and may again in the future. RLlib has not received updates in a while, and the current release is very buggy. We will update this if the situation improves.

Environments#

PufferLib ships with Ocean, our first-party testing suite, which will let you catch 90% of implementation bugs in a 10 second training run. We also provide integrations for many environments out of the box. Non-pip dependencies are already set up for you in PufferTank. Several environments also include reasonable baseline policies. Join our Discord if you would like to add setup and tests for new environments or improvements to any of the baselines.

OpenAI Gym is the standard API for single-agent reinforcement learning environments. It also contains some built-in environments. We include Box2D in our registry.

Pokemon Red is one of the original Pokemon games for gameboy. This project uses the game as an environment for reinforcement learning. We are actively supporting development on this one!

PettingZoo is the standard API for multi-agent reinforcement learning environments. It also contains some built-in environments. We include Butterfly in our registry.

Arcade Learning Environment provides a Gym interface for classic Atari games. This is the most popular benchmark for reinforcement learning algorithms.

Minigrid is a 2D grid-world environment engine and a collection of builtin environments. The target is flexible and computationally efficient RL research.

MAgent is a platform for large-scale agent simulation.

Neural MMO is a massively multiagent environment for reinforcement learning. It combines large agent populations with high per-agent complexity and is the most actively maintained (by me) project on this list.

Procgen is a suite of arcade games for reinforcement learning with procedurally generated levels. It is one of the most computationally efficient environments on this list.

Nethack Learning Environment is a port of the classic game NetHack to the Gym API. It combines extreme complexity with high simulation efficiency.

MiniHack Learning Environment is a stripped down version of NetHack with support for level editing and custom procedural generation.

Crafter is a top-down 2D Minecraft clone for RL research. It provides pixel observations and relatively long time horizons.

Griddly is an extremely optimized platform for building reinforcement learning environments. It also includes a large suite of built-in environments.

Gym MicroRTS is a real time strategy engine for reinforcement learning research. The Java configuration is a bit finicky -- we're still debugging this.

Current Limitations#

No continuous action spaces (planned for after 1.0)
Each agent must have the same observation and action space. True of most RL libraries, hard to work around without sacrificing performance or simplicity.

License#

PufferLib is free and open-source software under the MIT license.