Q-Evolve: Self-evolving LLM Agents with In-distribution Optimization

Abstract

What is Q-Evolve?

Large Language Models have emerged as powerful controllers for interactive agents, yet training them for reliable long-horizon decision making remains hard — a core difficulty being credit assignment under delayed, episodic rewards. We propose Q-Evolve, a self-evolving framework that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. Each evolving iteration learns an in-distribution critic from a hybrid off-policy dataset that mixes expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function then derives step-wise process rewards through advantage estimation — dense, reliable supervision without environment backtracking or human annotation. Using these signals, we perform behavior-proximal policy optimization over the same data used for reward labeling, enabling iterative self-improvement without exacerbating distribution shift. On AlfWorld, WebShop, and ScienceWorld, Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance.

Highlights

Key contributions

A closed self-evolving loop

Policy, critic, and dataset co-evolve. Each single-step policy update stays strictly within a fixed hybrid off-policy dataset, while critic updates push the agent toward better long-horizon behavior.

Weighted IQL critic for sparse rewards

An in-distribution critic trained on expert + on-train agent trajectories, with a weighted Implicit Q-Learning objective that up-weights successful, near-terminal steps to stabilize Bellman backups.

Process rewards via GAE

Rather than a standalone PRM, step-wise advantages from Generalized Advantage Estimation serve as process rewards — "filling in" missing intermediate signals with no backtracking or human labels.

BPPO with asymmetric clipping

Behavior-proximal policy optimization with a more permissive lower clip amplifies positive-advantage tokens and explicitly suppresses harmful ones — in-support updates that avoid distribution shift.

Motivation

Why in-distribution?

Existing PRM pipelines rely on costly manual labels or search-based rollouts over discretized states, and they break down under the distribution shift between where rewards are learned and where the policy actually operates. Q-Evolve generates and consumes step-wise supervision within the same distribution.

Method

The Q-Evolve framework

Figure 2: the Q-Evolve self-evolving framework. — **Figure 2. Framework of our self-evolving agent.** We warm up the policy via behavior cloning, then iteratively optimize it through in-distribution evolving loops. Each loop builds a hybrid offline buffer (expert + self-collected trajectories), applies rule-based retrospective labeling, propagates rewards via Bellman backups to learn a max-Q surrogate, derives step-level GAE advantages, and updates the policy — while every update stays constrained to the in-distribution data of that iteration.

Warm up — Behavior Cloning

Initialize the policy $\pi_{\text{BC}}$ by imitating expert demonstrations with a negative log-likelihood objective.

Hybrid data + retrospective labeling

Form $\mathcal{D}=\mathcal{D}_{\text{expert}}\cup\mathcal{D}_{\text{self}}$ from expert demos and the agent's own rollouts, then relabel each step with rule-based auxiliary rewards (format errors −0.3, invalid actions −0.2, no-change −0.1) — no environment access needed.

In-distribution critic — Weighted IQL

Learn $V$ and $Q$ with a weighted Implicit Q-Learning objective that prioritizes informative supervision: higher weight on successful trajectories and later, near-terminal steps.

Process rewards via GAE

Compute step-wise advantages $A_t=\delta_t+\lambda\gamma A_{t+1}$ from the critic, using environmental episodic reward only — keeping process rewards aligned with the true task objective.

In-distribution policy learning — BPPO

Update the policy with a clipped behavior-proximal objective using asymmetric clipping ($\epsilon_{\text{low}}>\epsilon_{\text{high}}$) to aggressively suppress negatively-labeled actions while keeping increases conservative. Then refresh the data and re-enter the loop.

$$\mathcal{L}_\pi(\theta)=\mathbb{E}_{\mathcal{D}}\Big[\min\big(\eta_t A_t,\ \mathrm{clip}(\eta_t,1-\epsilon_{\text{low}},1+\epsilon_{\text{high}})A_t\big)\Big]+\alpha\,\mathrm{KL}(\pi_\phi\,\|\,\pi_{\text{ref}}),\qquad \epsilon_{\text{low}}>\epsilon_{\text{high}}$$

Results

State-of-the-art across three benchmarks

**Table 2.** Performance on WebShop, ScienceWorld, and ALFWorld (seen / unseen splits), base model Llama-2-7B-Chat. **Bold** = best.
Method	WebShop	SciW Seen	SciW Unseen	ALFW Seen	ALFW Unseen	Average
GPT-4 (ReAct)	63.2	64.8	64.4	42.9	38.1	54.7
GPT-3.5-Turbo	62.4	16.5	13.0	7.9	10.5	22.1
Reflexion	64.2	60.3	64.4	45.7	55.2	58.0
Base Agent (Llama-2-7B)	17.9	3.8	3.1	0.0	0.0	5.0
SFT	63.1	67.4	53.0	60.0	67.2	62.1
RFT	63.6	71.6	54.3	62.9	66.4	63.8
PPO	64.2	59.4	51.7	22.1	29.1	45.3
Best-of-N	67.9	70.2	57.6	62.1	69.4	65.4
ETO	67.4	73.8	65.0	68.6	72.4	69.4
DMPO	70.1	72.4	61.7	–	–	–
QLASS	70.3	75.3	66.4	77.9	82.8	74.5
Q-Evolve OURS	70.5	76.3	69.7	90.7	89.6	79.4

Q-Evolve achieves the best score on every column — with the largest gains on the hardest ALFWorld splits (+12.8 / +6.8 over QLASS).

**Table 5.** Sample efficiency on ALFWorld (Qwen2.5-7B-Instruct). All online-RL baselines use 320K env steps. **Bold** = best.
Method	Env. Steps	Seen	Unseen
PPO	320K	59.4	67.7
RLOO	320K	56.4	36.6
GRPO	320K	39.7	32.2
SFT	0	74.9	62.3
SFT + PPO	320K	72.6	77.6
SFT + RLOO	320K	75.0	51.4
SFT + GRPO	320K	66.7	74.1
Q-Evolve (1-iter) OURS	13K	88.6	87.3

~25× fewer environment steps (13K vs. 320K) while beating every online-RL baseline by a wide margin.

**Table 6.** Generalization across model families — Llama-3-8B-Instruct. **Bold** = best.
Method	WebShop	SciW Seen	SciW Unseen	ALFW Seen	ALFW Unseen
SFT	63.3	65.3	57.0	79.3	80.6
ETO	68.4	81.3	74.1	77.1	76.4
KnowAgent	64.8	81.7	69.6	80.0	74.9
WKM	66.9	82.1	76.5	77.5	78.2
SFT + MPO	65.5	70.2	65.9	80.7	81.3
ETO + MPO	70.2	83.4	80.8	85.0	79.1
Q-Evolve OURS	71.1	86.4	82.4	89.6	90.3

Gains transfer across model families and scales — Q-Evolve leads on all tasks and both splits, confirming it is not tied to any particular backbone.

Cite

BibTeX

@inproceedings{zhang2026qevolve,
  title     = {Self-evolving LLM Agents with In-distribution Optimization},
  author    = {Zhang, Yudi and Fang, Meng and Chen, Zhenfang and Pechenizkiy, Mykola},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  series    = {PMLR},
  volume    = {306},
  year      = {2026},
  address   = {Seoul, South Korea}
}