Co-evolving process-level supervision and policy within a shared in-distribution learning loop.
1Eindhoven University of Technology • 2University of Liverpool • 3MIT-IBM Watson AI Lab
Large Language Models have emerged as powerful controllers for interactive agents, yet training them for reliable long-horizon decision making remains hard — a core difficulty being credit assignment under delayed, episodic rewards. We propose Q-Evolve, a self-evolving framework that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. Each evolving iteration learns an in-distribution critic from a hybrid off-policy dataset that mixes expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function then derives step-wise process rewards through advantage estimation — dense, reliable supervision without environment backtracking or human annotation. Using these signals, we perform behavior-proximal policy optimization over the same data used for reward labeling, enabling iterative self-improvement without exacerbating distribution shift. On AlfWorld, WebShop, and ScienceWorld, Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance.
Policy, critic, and dataset co-evolve. Each single-step policy update stays strictly within a fixed hybrid off-policy dataset, while critic updates push the agent toward better long-horizon behavior.
An in-distribution critic trained on expert + on-train agent trajectories, with a weighted Implicit Q-Learning objective that up-weights successful, near-terminal steps to stabilize Bellman backups.
Rather than a standalone PRM, step-wise advantages from Generalized Advantage Estimation serve as process rewards — "filling in" missing intermediate signals with no backtracking or human labels.
Behavior-proximal policy optimization with a more permissive lower clip amplifies positive-advantage tokens and explicitly suppresses harmful ones — in-support updates that avoid distribution shift.
Existing PRM pipelines rely on costly manual labels or search-based rollouts over discretized states, and they break down under the distribution shift between where rewards are learned and where the policy actually operates. Q-Evolve generates and consumes step-wise supervision within the same distribution.
Initialize the policy $\pi_{\text{BC}}$ by imitating expert demonstrations with a negative log-likelihood objective.
Form $\mathcal{D}=\mathcal{D}_{\text{expert}}\cup\mathcal{D}_{\text{self}}$ from expert demos and the agent's own rollouts, then relabel each step with rule-based auxiliary rewards (format errors −0.3, invalid actions −0.2, no-change −0.1) — no environment access needed.
Learn $V$ and $Q$ with a weighted Implicit Q-Learning objective that prioritizes informative supervision: higher weight on successful trajectories and later, near-terminal steps.
Compute step-wise advantages $A_t=\delta_t+\lambda\gamma A_{t+1}$ from the critic, using environmental episodic reward only — keeping process rewards aligned with the true task objective.
Update the policy with a clipped behavior-proximal objective using asymmetric clipping ($\epsilon_{\text{low}}>\epsilon_{\text{high}}$) to aggressively suppress negatively-labeled actions while keeping increases conservative. Then refresh the data and re-enter the loop.
| Method | WebShop | SciW Seen | SciW Unseen | ALFW Seen | ALFW Unseen | Average |
|---|---|---|---|---|---|---|
| GPT-4 (ReAct) | 63.2 | 64.8 | 64.4 | 42.9 | 38.1 | 54.7 |
| GPT-3.5-Turbo | 62.4 | 16.5 | 13.0 | 7.9 | 10.5 | 22.1 |
| Reflexion | 64.2 | 60.3 | 64.4 | 45.7 | 55.2 | 58.0 |
| Base Agent (Llama-2-7B) | 17.9 | 3.8 | 3.1 | 0.0 | 0.0 | 5.0 |
| SFT | 63.1 | 67.4 | 53.0 | 60.0 | 67.2 | 62.1 |
| RFT | 63.6 | 71.6 | 54.3 | 62.9 | 66.4 | 63.8 |
| PPO | 64.2 | 59.4 | 51.7 | 22.1 | 29.1 | 45.3 |
| Best-of-N | 67.9 | 70.2 | 57.6 | 62.1 | 69.4 | 65.4 |
| ETO | 67.4 | 73.8 | 65.0 | 68.6 | 72.4 | 69.4 |
| DMPO | 70.1 | 72.4 | 61.7 | – | – | – |
| QLASS | 70.3 | 75.3 | 66.4 | 77.9 | 82.8 | 74.5 |
| Q-Evolve OURS | 70.5 | 76.3 | 69.7 | 90.7 | 89.6 | 79.4 |
Q-Evolve achieves the best score on every column — with the largest gains on the hardest ALFWorld splits (+12.8 / +6.8 over QLASS).
| Method | Env. Steps | Seen | Unseen |
|---|---|---|---|
| PPO | 320K | 59.4 | 67.7 |
| RLOO | 320K | 56.4 | 36.6 |
| GRPO | 320K | 39.7 | 32.2 |
| SFT | 0 | 74.9 | 62.3 |
| SFT + PPO | 320K | 72.6 | 77.6 |
| SFT + RLOO | 320K | 75.0 | 51.4 |
| SFT + GRPO | 320K | 66.7 | 74.1 |
| Q-Evolve (1-iter) OURS | 13K | 88.6 | 87.3 |
~25× fewer environment steps (13K vs. 320K) while beating every online-RL baseline by a wide margin.
| Method | WebShop | SciW Seen | SciW Unseen | ALFW Seen | ALFW Unseen |
|---|---|---|---|---|---|
| SFT | 63.3 | 65.3 | 57.0 | 79.3 | 80.6 |
| ETO | 68.4 | 81.3 | 74.1 | 77.1 | 76.4 |
| KnowAgent | 64.8 | 81.7 | 69.6 | 80.0 | 74.9 |
| WKM | 66.9 | 82.1 | 76.5 | 77.5 | 78.2 |
| SFT + MPO | 65.5 | 70.2 | 65.9 | 80.7 | 81.3 |
| ETO + MPO | 70.2 | 83.4 | 80.8 | 85.0 | 79.1 |
| Q-Evolve OURS | 71.1 | 86.4 | 82.4 | 89.6 | 90.3 |
Gains transfer across model families and scales — Q-Evolve leads on all tasks and both splits, confirming it is not tied to any particular backbone.
@inproceedings{zhang2026qevolve,
title = {Self-evolving LLM Agents with In-distribution Optimization},
author = {Zhang, Yudi and Fang, Meng and Chen, Zhenfang and Pechenizkiy, Mykola},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
series = {PMLR},
volume = {306},
year = {2026},
address = {Seoul, South Korea}
}