ICML 2026  ·  Seoul, South Korea
Q-Evolve

Self‑evolving LLM Agents with In‑distribution Optimization

Co-evolving process-level supervision and policy within a shared in-distribution learning loop.

1Eindhoven University of Technology  •  2University of Liverpool  •  3MIT-IBM Watson AI Lab

Abstract

What is Q-Evolve?

Large Language Models have emerged as powerful controllers for interactive agents, yet training them for reliable long-horizon decision making remains hard — a core difficulty being credit assignment under delayed, episodic rewards. We propose Q-Evolve, a self-evolving framework that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. Each evolving iteration learns an in-distribution critic from a hybrid off-policy dataset that mixes expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function then derives step-wise process rewards through advantage estimation — dense, reliable supervision without environment backtracking or human annotation. Using these signals, we perform behavior-proximal policy optimization over the same data used for reward labeling, enabling iterative self-improvement without exacerbating distribution shift. On AlfWorld, WebShop, and ScienceWorld, Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance.

Highlights

Key contributions

1

A closed self-evolving loop

Policy, critic, and dataset co-evolve. Each single-step policy update stays strictly within a fixed hybrid off-policy dataset, while critic updates push the agent toward better long-horizon behavior.

2

Weighted IQL critic for sparse rewards

An in-distribution critic trained on expert + on-train agent trajectories, with a weighted Implicit Q-Learning objective that up-weights successful, near-terminal steps to stabilize Bellman backups.

3

Process rewards via GAE

Rather than a standalone PRM, step-wise advantages from Generalized Advantage Estimation serve as process rewards — "filling in" missing intermediate signals with no backtracking or human labels.

4

BPPO with asymmetric clipping

Behavior-proximal policy optimization with a more permissive lower clip amplifies positive-advantage tokens and explicitly suppresses harmful ones — in-support updates that avoid distribution shift.

Motivation

Why in-distribution?

Existing PRM pipelines rely on costly manual labels or search-based rollouts over discretized states, and they break down under the distribution shift between where rewards are learned and where the policy actually operates. Q-Evolve generates and consumes step-wise supervision within the same distribution.

Figure 1: comparison of existing methods and Q-Evolve.
Figure 1. Comparison of existing methods. Left: existing PRM methods rely on costly manual labels or search-based rollouts requiring discrete states, often failing under distribution shift between PRM training and policy improvement. Upper-mid: most online RL does not address episodic sparse rewards. Bottom-mid: our framework uses a hybrid off-policy dataset (expert + agent interaction) to derive rewards via Bellman backups, co-evolving reward supervision and policy in a shared in-distribution loop. Right: performance vs. environment steps required — Q-Evolve reaches 90 on AlfWorld with ~20K steps vs. QLASS's 600K.
Method

The Q-Evolve framework

Figure 2: the Q-Evolve self-evolving framework.
Figure 2. Framework of our self-evolving agent. We warm up the policy via behavior cloning, then iteratively optimize it through in-distribution evolving loops. Each loop builds a hybrid offline buffer (expert + self-collected trajectories), applies rule-based retrospective labeling, propagates rewards via Bellman backups to learn a max-Q surrogate, derives step-level GAE advantages, and updates the policy — while every update stays constrained to the in-distribution data of that iteration.
0

Warm up — Behavior Cloning

Initialize the policy $\pi_{\text{BC}}$ by imitating expert demonstrations with a negative log-likelihood objective.

1

Hybrid data + retrospective labeling

Form $\mathcal{D}=\mathcal{D}_{\text{expert}}\cup\mathcal{D}_{\text{self}}$ from expert demos and the agent's own rollouts, then relabel each step with rule-based auxiliary rewards (format errors −0.3, invalid actions −0.2, no-change −0.1) — no environment access needed.

2

In-distribution critic — Weighted IQL

Learn $V$ and $Q$ with a weighted Implicit Q-Learning objective that prioritizes informative supervision: higher weight on successful trajectories and later, near-terminal steps.

3

Process rewards via GAE

Compute step-wise advantages $A_t=\delta_t+\lambda\gamma A_{t+1}$ from the critic, using environmental episodic reward only — keeping process rewards aligned with the true task objective.

4

In-distribution policy learning — BPPO

Update the policy with a clipped behavior-proximal objective using asymmetric clipping ($\epsilon_{\text{low}}>\epsilon_{\text{high}}$) to aggressively suppress negatively-labeled actions while keeping increases conservative. Then refresh the data and re-enter the loop.

$$\mathcal{L}_\pi(\theta)=\mathbb{E}_{\mathcal{D}}\Big[\min\big(\eta_t A_t,\ \mathrm{clip}(\eta_t,1-\epsilon_{\text{low}},1+\epsilon_{\text{high}})A_t\big)\Big]+\alpha\,\mathrm{KL}(\pi_\phi\,\|\,\pi_{\text{ref}}),\qquad \epsilon_{\text{low}}>\epsilon_{\text{high}}$$
Results

State-of-the-art across three benchmarks

Table 2. Performance on WebShop, ScienceWorld, and ALFWorld (seen / unseen splits), base model Llama-2-7B-Chat. Bold = best.
Method WebShop SciW Seen SciW Unseen ALFW Seen ALFW Unseen Average
GPT-4 (ReAct)63.264.864.442.938.154.7
GPT-3.5-Turbo62.416.513.07.910.522.1
Reflexion64.260.364.445.755.258.0
Base Agent (Llama-2-7B)17.93.83.10.00.05.0
SFT63.167.453.060.067.262.1
RFT63.671.654.362.966.463.8
PPO64.259.451.722.129.145.3
Best-of-N67.970.257.662.169.465.4
ETO67.473.865.068.672.469.4
DMPO70.172.461.7
QLASS70.375.366.477.982.874.5
Q-Evolve OURS70.576.369.790.789.679.4

Q-Evolve achieves the best score on every column — with the largest gains on the hardest ALFWorld splits (+12.8 / +6.8 over QLASS).

Table 5. Sample efficiency on ALFWorld (Qwen2.5-7B-Instruct). All online-RL baselines use 320K env steps. Bold = best.
MethodEnv. StepsSeenUnseen
PPO320K59.467.7
RLOO320K56.436.6
GRPO320K39.732.2
SFT074.962.3
SFT + PPO320K72.677.6
SFT + RLOO320K75.051.4
SFT + GRPO320K66.774.1
Q-Evolve (1-iter) OURS13K88.687.3

~25× fewer environment steps (13K vs. 320K) while beating every online-RL baseline by a wide margin.

Table 6. Generalization across model families — Llama-3-8B-Instruct. Bold = best.
MethodWebShop SciW SeenSciW Unseen ALFW SeenALFW Unseen
SFT63.365.357.079.380.6
ETO68.481.374.177.176.4
KnowAgent64.881.769.680.074.9
WKM66.982.176.577.578.2
SFT + MPO65.570.265.980.781.3
ETO + MPO70.283.480.885.079.1
Q-Evolve OURS71.186.482.489.690.3

Gains transfer across model families and scales — Q-Evolve leads on all tasks and both splits, confirming it is not tied to any particular backbone.

Cite

BibTeX

@inproceedings{zhang2026qevolve,
  title     = {Self-evolving LLM Agents with In-distribution Optimization},
  author    = {Zhang, Yudi and Fang, Meng and Chen, Zhenfang and Pechenizkiy, Mykola},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  series    = {PMLR},
  volume    = {306},
  year      = {2026},
  address   = {Seoul, South Korea}
}