EfficientZero Remastered

This repo contains the pre-trained models for the EfficientZero Remastered project from Gigglebit Studios, a project to stabilize the training process for the state of the art EfficientZero model.

Huge thanks to Stability AI for providing the compute for this project!

How to use these files

Download the model that you want to test, then run test.py to test the model.

Note: We've only productionized the training process. If you want to use these for inference in production, you'll need to write your own inference logic. If you do, send us a PR and we'll add it to the repo!

Files are labeled as follows:

{gym_env}-s{seed}-e{env_steps}-t{train_steps}

Where:

gym_env: The string ID of the gym environment this model was trained on. E.g. Breakout-v5
seed: The seed that was used to train this model. Usually 0.
env_steps: The total number of steps in the environment that this model observed, usually 100k.
train_steps: The total number of training epochs the model underwent.

Note that env_steps can differ from train_steps because the model can continue fine-tuning using its replay buffer. In the paper, the last 20k epochs are done in this manner. This isn't necessary outside of benchmarks and in theory better performance should be attainable by getting more samples from the env.

Findings

Our primary goal in this project was to test out EfficientZero and see its capabilities. We were amazed by the model overall, especially on Breakout, where it far outperformed the human baseline. The overall cost was only about $50 per fully trained model, compared to the hundreds of thousands of dollars needed to train MuZero.

Though the trained models achieved impressive scores in Atari, they didn't reach the stellar scores demonstrated in the paper. This could be because we used different hardware and dependencies or because ML research papers tend to cherry-pick models and environments to showcase good results.

Additionally, the models tended to hit a performance wall between 75-100k steps. While we don't have enough data to know why or how often this happens, it's not surprising: the model was tuned specifically for data efficiency, so it hasn't been tested at larger scales. A model like MuZero might be more appropriate if you have a large budget.

Training times seemed longer than those reported in the EfficientZero paper. The paper stated that they could train a model to completion in 7 hours, while in practice, we've found that it takes an A100 with 32 cores between 1 to 2 days to train a model to completion. This is likely because the training process uses more CPU than other models and therefore does not perform well on the low-frequency, many-core CPUs found in GPU clusters.