ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub

Achieving Full Parity: Fixing Core RL Misalignments Between vLLM V0 and V1

vLLM Reinforcement Learning RL Inference Engine Training/Inference Mismatch logprobs fp32 lm_head
May 06, 2026
Viqus Verdict Logo Viqus Verdict Logo 7
Critical Engineering Protocol
Media Hype 3/10
Real Impact 7/10

Article Summary

This article documents a complex, multi-stage engineering effort to ensure complete parity when migrating a critical Reinforcement Learning (RL) pipeline from vLLM V0 to vLLM V1. The mismatch, dubbed 'train-inference mismatch,' was found to stem from several sources: semantic errors (e.g., V1 returning raw logprobs instead of processed ones), inference-path mismatches (different runtime defaults for caching and scheduling), and required fixes in the weight-update process. The authors successfully restored full parity by explicitly controlling parameters such as disabling prefix caching, matching weight-update semantics, and finally, ensuring the final projection used an fp32 lm_head, restoring the RL objective's trajectory to match the V0 reference.

Key Points

  • The core challenge was a 'train-inference mismatch' in RL objectives, requiring the backend behavior to be matched perfectly before optimizing the RL process.
  • Several fixes were required, including changing the logprob output mode to 'processed_logprobs' and explicitly disabling default V1 features like prefix caching and async scheduling for parity.
  • The final and most crucial fix was forcing the use of an fp32 lm_head for the final projection, which resolved the final numerical gap in the policy ratio and reward curves.

Why It Matters

For those building mission-critical, production-grade RLAIF/RLHF pipelines, this paper is a vital technical blueprint. It serves as a necessary warning that migration between major AI infrastructure versions (like vLLM) is not a simple drop-in replacement. It stresses the extreme sensitivity of online RL systems to minor backend discrepancies—whether they are related to caching, quantization precision (fp32), or logging method. Professionals should treat any vLLM or similar inference engine upgrade with extreme caution, verifying every boundary condition and numerical detail, as these subtle bugs can lead to significant, unobservable performance degradation in the trained model.

You might also be interested in