PRIME

Published: December 01, 2024

Introduction:

We present PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards, to advance reasoning abilities of language models beyond imitation or distillation.
With PRIME, starting from Qwen2.5-Math-7B-Base, our trained model Eurus-2-7B-PRIME achieves 26.7% pass@1 on AIME 2024, surpassing GPT-4o and Qwen2.5-Math-7B-Instruct. We achieve this with only 1/10 data of Qwen Math (230K SFT + 150K RL).
We also explore inference-time scaling and train EurusPRM, a SOTA-level math PRM that pushes the boundary even further.
Work in Progress. All models and data released. Code coming soon!