When LLM Agents Meet Reinforcement Learning

AgentsMeetRL is an awesome list that summarizes open-source repositories for training LLM Agents using reinforcement learning:

🤖 The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo).
⚠️ This project is based on code analysis from open-source repositories using LLM coding agents, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!
🚀 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table.
📅 Last updated: 2026-03-24
🤗 Feel free to submit your own projects anytime - we welcome contributions!

Taxonomy:

Base Framework: General-purpose RL training frameworks for LLM agents (e.g., veRL, OpenRLHF, trl)
General/MultiTask: Agent systems trained/evaluated across multiple tasks or environments
Search & RAG: Search-augmented reasoning agents that use retrieval tools to enhance LLM reasoning
Web & GUI: Agents that interact with web browsers, mobile/desktop GUIs, or operating systems
Tool-Use: Agents trained to invoke external tools (APIs, code executors, MCP, etc.)
Code & SWE: Software engineering and code generation agents
Reasoning: Reasoning agents with tool-integrated or multi-turn reasoning (math, QA, visual)
Multi-Agent RL: Multi-agent collaboration, negotiation, or credit assignment via RL
Memory: Agents that learn to manage, retrieve, or evolve memory
Embodied: Agents operating in embodied/physical simulation environments
Domain-Specific: RL agents for specialized domains (medical, OS tuning, etc.)
Reward & Training: Process/outcome reward models and training methodologies for agents
Safety: RL for agent safety alignment, adversarial red-teaming, and jailbreak defense/attack
VLM Agent: Vision-language model agents trained with RL for multimodal interaction
Self-Evolution: Agents that self-evolve via RL feedback loops (⚠️ definition still evolving in the community)
Environment: Benchmarks, gyms, and sandbox environments for agent training/evaluation

Some Enumeration:

Enumeration for Reward Type:
- External Verifier: e.g., a compiler or math solver
- Rule-Based: e.g., a LaTeX parser with exact match scoring
- Model-Based: e.g., a trained verifier LLM or reward LLM
- Custom

Updates

📢 2026-03 Update: Restructured taxonomy from 12 to 16 categories. Added ~70 new repositories covering Sep 2025 – Mar 2026. New categories include Multi-Agent RL, Reward & Training, Safety, VLM Agent, Self-Evolution, and Domain-Specific. Merged the old GUI and Web into Web & GUI, retired TextGame and Biomedical as standalone categories. Total repos grew from ~134 to 205.

🔧 Base Framework

Github Repo	Date	Org	Paper Link
Open-AgentRL	2026.2	Gen-Verse	Paper
OpenClaw-RL	2026.3	Gen-Verse	Paper
Claw-R1	2026.3	USTC	--
prime-rl	2025.2	Prime Intellect	--
NeMo-RL	2026.1	NVIDIA	--
RLinf	2025.8	Tsinghua/Infinigence AI/PKU	Paper
siiRL	2025.7	Shanghai Innovation Institute	Paper
slime	2025.6	Tsinghua University (THUDM)	blog
agent-lightning	2025.6	Microsoft Research	Paper
AReaL	2025.6	AntGroup/Tsinghua	Paper
ROLL	2025.6	Alibaba	Paper
MARTI	2025.5	Tsinghua	--
RL2	2025.4	Accio	–
verifiers	2025.3	Individual	--
oat	2024.11	NUS/Sea AI	Paper
veRL	2024.10	ByteDance	Paper
OpenRLHF	2023.7	OpenRLHF	Paper
trl	2019.11	HuggingFace	--

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
Open-AgentRL	GRPO-TCR	Single	Both	Multi	Reasoning/GUI/Coding	Model (PRM)	Yes (SandboxFusion)
OpenClaw-RL	GRPO/OPD	Both	Both	Multi	Terminal/GUI/SWE/Tool-call	Model/External	Yes
Claw-R1	Generic RL Framework	Multi	Both	Multi	General Agent	All	Yes (Framework-agnostic)
prime-rl	GRPO/PPO	Multi	Outcome	Multi	Math/Code/Search	Model/External	Yes
NeMo-RL	GRPO/DAPO/GDPO/DPO	Single	Outcome	Multi	Math/Reasoning/Code	Rule/External	No
RLinf	PPO/GRPO/DAPO/SAC/REINFORCE++/CrossQ/RLPD	Both	Both	Multi	Robotics/Math/Code/QA/VQA	All (Rule/Model/External)	Yes
siiRL	PPO/GRPO/CPGD/MARFT	Multi	Both	Multi	LLM/VLM/LLM-MAS PostTraining	Model/Rule	Planned
slime	GRPO/GSPO/REINFORCE++	Single	Both	Both	Math/Code	External Verifier	Yes
agent-lightning	PPO/Custom/Automatic Prompt Optimization	Multi	Outcome	Multi	Calculator/SQL	Model/External/Rule	Yes
AReaL	PPO	Both	Outcome	Both	Math/Code	External	Yes
ROLL	PPO/GRPO/Reinforce++/TOPR/RAFT++	Multi	Both	Multi	Math/QA/Code/Alignment	All	Yes
MARTI	PPO/GRPO/REINFORCE++/TTRL	Multi	Both	Multi	Math	All	Yes
RL2	Dr. GRPO/PPO/DPO	Single	Both	Both	QA/Dialogue	Rule/Model/External	Yes
verifiers	GRPO	Multi	Outcome	Both	Reasoning/Math/Code	All	Code
oat	PPO/GRPO	Single	Outcome	Multi	Math/Alignment	External	No
veRL	PPO/GRPO	Single	Outcome	Both	Math/QA/Reasoning/Search	All	Yes
OpenRLHF	PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO	Multi	Both	Both	Dialogue/Chat/Completion	Rule/Model/External	Yes
trl	PPO/GRPO/DPO	Single	Both	Single	QA	Custom	No

💪 General/MultiTask

Github Repo	Date	Org	Paper Link	RL Framework
MetaClaw	2026.3	UNC-Chapel Hill (AIMING Lab)	Paper	Custom
SkillRL	2026.2	UNC-Chapel Hill (AIMING Lab)	Paper	Custom
LLM-in-Sandbox	2026.1	RUC/MSRA/THU	Paper	rllm (w/ veRL)
youtu-agent	2025.12	Tencent Youtu Lab	Paper	Custom
DEPO	2025.11	HKUST/SJTU	Paper	LLaMA-Factory
SPEAR	2025.10	Tencent Youtu Lab	Paper	veRL/verl-agent
DeepAgent	2025.10	RUC/Xiaohongshu	Paper	Custom
AgentRL	2025.9	Tsinghua	Paper	veRL
AgentGym-RL	2025.9	Fudan University	Paper	veRL
Agent_Foundation_Models	2025.8	OPPO Personal AI Lab	Paper	veRL
Trinity-RFT	2025.5	Alibaba	Paper	veRL
SPA-RL-Agent	2025.5	PolyU	Paper	TRL
verl-agent	2025.5	NTU/Skywork	Paper	veRL
VAGEN	2025.3	RAGEN-AI	Paper	veRL
ART	2025.3	OpenPipe	Paper	TRL
OpenManus-RL	2025.3	UIUC/MetaGPT	--	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MetaClaw	GRPO (LoRA)	Single	Process	Multi	General Agentic	Model (PRM)	Yes (Skill-augmented)
SkillRL	GRPO	Single	Outcome	Multi	ALFWorld/WebShop/Search	Rule	Yes (Web search, actions)
LLM-in-Sandbox	GRPO++	Single	Outcome	Multi	Math/Physics/Chemistry/Biomedicine/Long-context/IF/SWE	Rule	Yes (Code Sandbox w/ Terminal, File, Internet)
youtu-agent	Training-Free GRPO	Single	Outcome	Multi	Deep Research/Data Analysis/Tool-use	Model/External	Yes (Web search, code, file)
DEPO	KTO + Efficiency Loss	Single	Both	Multi	Agent (BabyAI/WebShop)	Rule	Yes
SPEAR	GRPO/GiGPO + SIL	Single	Both	Multi	Math/Agent	Rule/External	Yes (Search, Sandbox, Browser)
DeepAgent	ToolPO	Single	Outcome	Multi	ToolBench/ALFWorld/WebShop/GAIA/HLE	Model	Yes (16,000+ RapidAPIs)
AgentRL	GRPO/REINFORCE++/RLOO/ReMax/GAE	Single	Outcome	Multi	Agent Tasks	External	Yes
AgentGym-RL	PPO/GRPO/RLOO/REINFORCE++	Single	Outcome	Multi	Web/Search/Game/Embodied/Science	Rule/Model/External	Yes (Web, Search, Env APIs)
Agent_Foundation_Models	DAPO/PPO	Single	Outcome	Single	QA/Code/Math	Rule/External	Yes
Trinity-RFT	PPO/GRPO	Single	Outcome	Both	Math/TextGame/Web	All	Yes
SPA-RL-Agent	PPO	Single	Process	Multi	Navigation/Web/TextGame	Model	No
verl-agent	PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++	Multi	Both	Multi	Phone Use/Math/Code/Web/TextGame	All	Yes
VAGEN	PPO/GRPO	Single	Both	Multi	TextGame/Navigation	All	Yes
ART	GRPO	Multi	Both	Multi	TextGame	All	Yes
OpenManus-RL	PPO/DPO/GRPO	Multi	Outcome	Multi	TextGame	All	Yes

🔍 Search & RAG Agent

Github Repo	Date	Org	Paper Link	RL Framework
ProRAG	2026.1	RUC	Paper	Custom
MemSearcher	2025.11	CAS	Paper	Custom
ReSeek	2025.10	Tencent PCG BAC/Tsinghua University	Paper	veRL
AutoGraph-R1	2025.10	HKUST KnowComp	Paper	Custom
Tree-GRPO	2025.9	AMAP	Paper	veRL
ASearcher	2025.8	Ant Research RL Lab Tsinghua University & UW	Paper	RealHF/AReaL
Graph-R1	2025.7	BUPT/NTU/NUS	Paper	veRL
Kimi-Researcher	2025.6	Moonshot AI	blog	Custom
R-Search	2025.6	Individual	--	veRL
R1-Searcher-plus	2025.5	RUC	Paper	Custom
StepSearch	2025.5	SenseTime	Paper	veRL
AutoRefine	2025.5	USTC	Paper	veRL
ZeroSearch	2025.5	Alibaba	Paper	veRL
ReasonRAG	2025.5	CityU HK / Huawei	Paper	Custom
Agentic-RAG-R1	2025.12	PKU	--	Custom
WebThinker	2025.4	RUC	Paper	Custom
DeepResearcher	2025.4	SJTU	Paper	veRL
Search-R1	2025.3	UIUC/Google	paper1, paper2	veRL
R1-Searcher	2025.3	RUC	Paper	OpenRLHF
C-3PO	2025.2	Alibaba	Paper	OpenRLHF
DeepRetrieval	2025.2	UIUC	Paper	veRL
SSRL	2025.8	Tsinghua	Paper	Custom
Research-Venus	2025.8	Ant Group	Paper	Custom
DeepResearch	2025.9	Alibaba/Tongyi Lab	Paper	Custom
DeepDive	2025.9	Tsinghua/THUDM	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
ProRAG	GRPO + DGA (dual-granularity advantage)	Single	Both	Multi	Multi-hop RAG	Model (PRM via MCTS)	Yes (Retrieval)
MemSearcher	Multi-context GRPO	Single	Outcome	Multi	Search/QA + Memory	Rule/Model	Yes (Web search + Memory)
ReSeek	GRPO/PPO	Single	Both	Multi	QA/Search	Rule	Search/JUDGE
AutoGraph-R1	GRPO (via VeRL)	Single	Outcome	Multi	KG Construction for QA	Rule	Yes (Graph retrieval)
Tree-GRPO	GRPO/Tree-GRPO	Single	Outcome	Multi	Search	Rule	Search
ASearcher	PPO/GRPO + Decoupled PPO	Single	Outcome	Multi	Math/Code/SearchQA	External/Rule	Yes
Graph-R1	GRPO/REINFORCE++/PPO	Single	Outcome	Multi	KGQA	Rule (EM/F1)	Yes (Graph retrieval)
Kimi-Researcher	REINFORCE	Single	Outcome	Multi	Research	Outcome	Search, Browse, Coding
R-Search	PPO/GRPO	Single	Both	Multi	QA/Search	All	Yes
R1-Searcher-plus	Custom	Single	Outcome	Multi	Search	Model	Search
StepSearch	PPO	Single	Process	Multi	QA	Model	Search
AutoRefine	PPO/GRPO	Multi	Both	Multi	RAG QA	Rule	Search
ZeroSearch	PPO/GRPO/REINFORCE	Single	Outcome	Multi	QA/Search	Rule	Yes
ReasonRAG	DPO + MCTS-based PRM	Single	Process	Multi	Multi-hop QA	Model (PRM)	Yes (Wikipedia search)
Agentic-RAG-R1	GRPO	Single	Outcome	Multi	Knowledge-intensive QA	Rule/Model	Yes (Wiki/Doc search)
WebThinker	DPO	Single	Outcome	Multi	Reasoning/QA/Research	Model/External	Web Browsing
DeepResearcher	PPO/GRPO	Multi	Outcome	Multi	Research	All	Yes
Search-R1	PPO/GRPO	Single	Outcome	Multi	Search	All	Search
R1-Searcher	PPO/DPO	Single	Both	Multi	Search	All	Yes
C-3PO	PPO	Multi	Outcome	Multi	Search	Model	Yes
DeepRetrieval	GRPO	Single	Outcome	Multi	Query Generation/IR	Rule	Yes (Search)
SSRL	GRPO	Single	Outcome	Multi	Self-Search	Rule	Yes (Self-search)
Research-Venus	GRPO	Single	Both	Multi	Deep Research	Model (atomic thought)	Yes (Search)
DeepResearch	RL-based	Single	Outcome	Multi	Deep Research	Model	Yes (Search, Browse)
DeepDive	GRPO	Single	Outcome	Multi	KG-augmented Search	Rule	Yes (KG + Search)

🌐 Web & GUI Agent

Github Repo	Date	Org	Paper Link	RL Framework
MobileAgent	2025.9	X-PLUG (TongyiQwen)	paper	veRL
InfiGUI-G1	2025.8	InfiX AI	Paper	veRL
UI-AGILE	2025.7	Xiamen University	Paper	Custom
gui-rcpo	2025.8	Zhejiang University	Paper	Custom
Grounding-R1	2025.6	Salesforce	blog	trl
AgentCPM-GUI	2025.6	OpenBMB/Tsinghua/RUC	Paper	Huggingface
TTI	2025.6	CMU	Paper	Custom
SE-GUI	2025.5	Nankai University/vivo	Paper	trl
ARPO	2025.5	CUHK/HKUST	Paper	veRL
GUI-G1	2025.5	RUC	Paper	TRL
WebAgent-R1	2025.5	Amazon/UVA	Paper	Custom
GUI-R1	2025.4	CAS/NUS	Paper	veRL
UI-R1	2025.3	vivo/CUHK	Paper	TRL
CollabUIAgents	2025.2	Tsinghua/Alibaba/HKUST	Paper	Custom
WebAgent	2025.1	Alibaba	paper1, paper2	LLaMA-Factory
UI-TARS	2025.9	ByteDance Seed	Paper	Custom
DigiQ	2025.2	UC Berkeley/CMU/Amazon	Paper	Custom
ZeroGUI	2025.5	Shanghai AI Lab	Paper	Custom
InfiGUI-R1	2025.4	Zhejiang University	Paper	Custom
GUI-Agent-RL	2025.2	Microsoft	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MobileAgent	semi-online RL	Single	Both	Multi	MobileGUI/Automation	Rule	Yes
InfiGUI-G1	AEPO	Single	Outcome	Single	GUI/Grounding	Rule	No
UI-AGILE	GRPO	Single	Outcome	Single	GUI Grounding	Rule (continuous)	No
gui-rcpo	RCPO	Single	Outcome	Single	GUI Grounding	Rule (self-supervised)	No
Grounding-R1	GRPO	Single	Outcome	Multi	GUI Grounding	Model	Yes
AgentCPM-GUI	GRPO	Single	Outcome	Multi	Mobile GUI	Model	Yes
TTI	REINFORCE/BC	Single	Outcome	Multi	Web	External	Web Browsing
SE-GUI	GRPO	Single	Both	Single	GUI Grounding	Rule	Yes
ARPO	GRPO	Single	Outcome	Multi	GUI	External	Computer Use
GUI-G1	GRPO	Single	Outcome	Single	GUI	Rule/External	No
WebAgent-R1	M-GRPO	Single	Outcome	Multi	Web Navigation (WebArena-Lite)	Rule (task success)	Yes (Web browsing)
GUI-R1	GRPO	Single	Outcome	Multi	GUI	Rule	No
UI-R1	GRPO	Single	Process	Both	GUI	Rule	Computer/Phone Use
CollabUIAgents	DPO (credit re-assignment)	Multi	Process	Multi	GUI (Mobile + Web)	Model (LLM)	Yes (GUI interaction)
WebAgent	DAPO	Multi	Process	Multi	Web	Model	Yes
UI-TARS	Multi-turn RL	Single	Both	Multi	GUI (Cross-platform)	Model	Yes (GUI actions)
DigiQ	Value-based offline RL	Single	Outcome	Multi	Android Device Control	Model (Q-function)	Yes
ZeroGUI	Online RL	Single	Outcome	Multi	GUI Agent	Rule	Yes (GUI actions)
InfiGUI-R1	RL + sub-goal guidance	Single	Both	Multi	GUI Reasoning	Rule	Yes
GUI-Agent-RL	Value-based RL (VEM)	Single	Outcome	Multi	GUI (Web Shopping)	Model	Yes

🔨 Tool-Use Agent

Github Repo	Date	Org	Paper Link	RL Framework
MATPO	2025.10	MiroMind AI	Paper	Custom
MiroRL	2025.8	MiroMindAI	HF Repo	veRL
verl-tool	2025.6	TIGER-Lab	X	veRL
Multi-Turn-RL-Agent	2025.5	University of Minnesota	Paper	Custom
Tool-N1	2025.5	NVIDIA	Paper	veRL
Tool-Star	2025.5	RUC	Paper	LLaMA-Factory
RL-Factory	2025.5	Simple-Efficient	model	veRL
ReTool	2025.4	ByteDance	Paper	veRL
AWorld	2025.3	Ant Group (inclusionAI)	Paper	veRL
Agent-R1	2025.3	USTC	Paper	veRL
ReCall	2025.3	BaiChuan	Paper	veRL
ToolRL	2025.4	UIUC	Paper	veRL

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MATPO	GRPO (multi-agent)	Multi	Outcome	Multi	Tool-use/Search	Rule	Yes (MCP: Serper, Web scraping)
MiroRL	GRPO	Single	Both	Multi	Reasoning/Planning/ToolUse	Rule-based	MCP
verl-tool	PPO/GRPO	Single	Both	Both	Math/Code	Rule/External	Yes
Multi-Turn-RL-Agent	GRPO	Single	Both	Multi	Tool-use/Math	Rule/External	Yes
Tool-N1	PPO	Single	Outcome	Multi	Math/Dialogue	All	Yes
Tool-Star	PPO/DPO/ORPO/SimPO/KTO	Single	Outcome	Multi	Multi-modal/Tool Use/Dialogue	Model/External	Yes
RL-Factory	GRPO	Multi	Both	Multi	Tool-use/NL2SQL	All	MCP
ReTool	PPO	Single	Outcome	Multi	Math	External	Code
AWorld	GRPO	Both	Outcome	Multi	Search/Web/Code	External/Rule	Yes
Agent-R1	PPO/GRPO	Single	Both	Multi	Tool-use/QA	Model	Yes
ReCall	PPO/GRPO/RLOO/REINFORCE++/ReMax	Single	Outcome	Multi	Tool-use/Math/QA	All	Yes
ToolRL	GRPO/PPO	Single	Outcome	Multi	Tool Learning	Rule/External	Yes

💻 Code & SWE Agent

Github Repo	Date	Org	Paper Link	RL Framework
CUDA-Agent	2026.2	ByteDance/Tsinghua	Paper	Custom
LLM-in-Sandbox	2026.1	RUC/MSRA/THU	Paper	rllm (w/ veRL)
PPP-Agent	2025.11	CMU/OpenHands	Paper	veRL
RepoDeepSearch	2025.8	PKU, Bytedance, BIT	Paper	veRL
CUDA-L1	2025.7	DeepReinforce AI	Paper	Custom
MedAgentGym	2025.6	Emory/Georgia Tech	Paper	Hugginface
CURE	2025.6	University of Chicago Princeton/ByteDance	Paper	Huggingface
Time-R1	2025.5	UIUC	Paper	veRL
ML-Agent	2025.5	MASWorks	Paper	Custom
SkyRL	2025.4	NovaSky	Paper	veRL
digitalhuman	2025.4	Tencent	Paper	veRL
sweet_rl	2025.3	Meta/UCB	Paper	OpenRLHF
swe-rl	2025.2	Meta/UIUC/CMU	Paper	Custom
rllm	2025.1	Berkeley Sky Computing Lab BAIR / Together AI	Notion Blog	veRL
open-r1	2025.1	HuggingFace	--	TRL
R1-Code-Interpreter	2025.5	MIT	Paper	Custom
CTRL	2025.2	HKU/ByteDance	Paper	Custom
DeepAnalyze	2025.10	RUC/Tsinghua	Paper	Custom
AceCoder	2025.2	Waterloo (TIGER-Lab)	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
CUDA-Agent	Agentic RL (staged)	Single	Outcome	Multi	CUDA Kernel Generation	Rule (correctness + performance)	Yes (compile/verify/profile)
LLM-in-Sandbox	GRPO++	Single	Outcome	Multi	Code/SWE + General (Math/Sci/Bio)	Rule	Yes (Code Sandbox w/ Terminal, File, Internet)
PPP-Agent	PPP-RL	Single	Both	Multi	SWE/Research	Rule+Model	Search, Ask, Browse
RepoDeepSearch	GRPO	Single	Both	Multi	Search/Repair	Rule/External	Yes
CUDA-L1	Contrastive RL	Single	Outcome	Single	CUDA Optimization	Rule (performance)	No
MedAgentGym	SFT/DPO/PPO/GRPO	Single	Outcome	Multi	Medical/Code	External	Yes
CURE	PPO	Single	Outcome	Single	Code	External	No
Time-R1	PPO/GRPO/DPO	Multi	Outcome	Multi	Temporal	All	Code
ML-Agent	Custom	Single	Process	Multi	Code	All	Yes
SkyRL	PPO/GRPO	Single	Outcome	Multi	Math/Code	All	Code
digitalhuman	PPO/GRPO/ReMax/RLOO	Multi	Outcome	Multi	Empathy/Math/Code/MultimodalQA	Rule/Model/External	Yes
sweet_rl	DPO	Multi	Process	Multi	Design/Code	Model	Web Browsing
swe-rl	RL-based	Single	Outcome	Single	SWE (SWE-bench)	Rule (similarity)	No
rllm	PPO/GRPO	Single	Outcome	Multi	Code Edit	External	Yes
open-r1	GRPO	Single	Outcome	Single	Math/Code	All	Yes
R1-Code-Interpreter	GRPO	Single	Outcome	Multi	Code Interpretation	Rule/External	Yes (Code exec)
CTRL	RL (critique-revision)	Single	Process	Multi	Code Refinement	Model	Yes (Code exec)
DeepAnalyze	Curriculum RL	Single	Outcome	Multi	Data Science	Rule/External	Yes (Code exec)
AceCoder	GRPO	Single	Outcome	Single	Code Generation	External (test cases)	Yes

🤔 Reasoning Agent

Github Repo	Date	Org	Paper Link	RL Framework
Agent0	2025.10	UNC‑Chapel Hill / Salesforce Research / Stanford University	Paper	veRL
KG-R1	2025.9	UIUC/Google	Paper1, Paper2	veRL
AgentFlow	2025.09	Stanford University	arXiv	veRL
ARPO	2025.7	RUC, Kuaishou	Paper	veRL
terminal-bench-rl	2025.7	Individual (Danau5tin)	N/A	rLLM
MOTIF	2025.6	University of Maryland	Paper	trl
cmriat/l0	2025.6	CMRIAT	Paper	veRL
agent-distillation	2025.5	KAIST	Paper	Custom
EasyR1	2025.4	Individual	repo1/paper2	veRL
AutoCoA	2025.3	BJTU	Paper	veRL
ToRL	2025.3	SJTU	Paper	veRL
ReMA	2025.3	SJTU, UCL	Paper	veRL
Agentic-Reasoning	2025.2	Oxford	Paper	Custom
SimpleTIR	2025.2	NTU, Bytedance	Notion Blog	veRL
openrlhf_async_pipline	2024.5	OpenRLHF	Paper	OpenRLHF

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
Agent0	ADPO	Multi	Process	Multi	Math/Visual	Model/Verifier	Yes
KG-R1	GRPO/PPO	Single	Both	Multi	KGQA	Rule/Model	KG Retrieval
AgentFlow	Flow-GRPO	Single	Outcome	Multi	Search/Math/QA	Model/External	Yes
ARPO	GRPO	Single	Outcome	Multi	Math/Coding	Model/Rule	Yes
terminal-bench-rl	GRPO	Single	Outcome	Multi	Coding/Terminal	Model+External Verifier	Yes
MOTIF	GRPO	Single	Outcome	Multi	QA	Rule	No
cmriat/l0	PPO	Multi	Process	Multi	QA	All	Yes
agent-distillation	PPO	Single	Process	Multi	QA/Math	External	Yes
EasyR1	GRPO	Single	Process	Multi	Vision-Language	Model	Yes
AutoCoA	GRPO	Multi	Outcome	Multi	Reasoning/Math/QA	All	Yes
ToRL	GRPO	Single	Outcome	Single	Math	Rule/External	Yes
ReMA	PPO	Multi	Outcome	Multi	Math	Rule	No
Agentic-Reasoning	Custom	Single	Process	Multi	QA/Math	External	Web Browsing
SimpleTIR	PPO/GRPO (with extensions)	Single	Outcome	Multi	Math, Coding	All	Yes
openrlhf_async_pipline	PPO/REINFORCE++/DPO/RLOO	Single	Outcome	Multi	Dialogue/Reasoning/QA	All	No

👥 Multi-Agent RL

Github Repo	Date	Org	Paper Link	RL Framework
PettingLLMs	2025.10	Intel / UCSD	Paper	Custom
MASPRM	2025.10	UBC / Huawei	Paper	Custom
ARIA	2025.6	Fudan University	Paper	Custom
AMPO	2025.5	Tongyi Lab, Alibaba	Paper	veRL
MAPoRL	2025.8	Academic	--	Custom
FlowReasoner	2025.4	Sea AI Lab / NUS	Paper	Custom
DrMAS	2026.2	NTU	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
PettingLLMs	AT-GRPO	Multi	Both	Multi	Game/Code/Math/Planning	Rule (verifiable)	No
MASPRM	PRM (trained from MCTS rollouts)	Multi	Process	Multi	Reasoning (GSM8K/MATH/MMLU)	Learned PRM	No
ARIA	REINFORCE	Both	Process	Multi	Negotiation/Bargaining	Other	No
AMPO	BC/AMPO(GRPO improvement)	Multi	Outcome	Multi	Social Interaction	Model-based	No
MAPoRL	PPO	Multi	Outcome	Multi	Collaborative LLM Tasks	Rule	No
FlowReasoner	GRPO	Multi	Outcome	Multi	Multi-agent Workflow Design	Rule	Yes
DrMAS	GRPO (agent-wise)	Multi	Outcome	Multi	Multi-agent LLM Systems	Rule	No

🧠 Memory

Github Repo	Date	Org	Paper Link	RL Framework
MEM1	2025.7	MIT	Paper	veRL (based on Search-R1)
Memento	2025.6	UCL, Huawei	Paper	Custom
MemAgent	2025.6	Bytedance, Tsinghua-SIA	Paper	veRL

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MEM1	PPO/GRPO	Single	Outcome	Multi	WebShop/GSM8K/QA	Rule/Model	Yes
Memento	soft Q-Learning	Single	Outcome	Multi	Research/QA/Code/Web	External/Rule	Yes
MemAgent	PPO, GRPO, DPO	Multi	Outcome	Multi	Long-context QA	Rule/Model/External	Yes

🦾 Embodied

Github Repo	🌟 Stars	Date	Org	Paper Link	RL Framework
Embodied-R1		2025.6	Tianjing University	Paper	veRL
STeCa		2025.2	The Hong Kong Polytechnic University	Paper	FastChat/TRL

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
Embodied-R1	GRPO	Single	Outcome	Single	Grounding/Waypoint	Rule	No
STeCa	DPO (RFT)	Single	Both	Multi	Embodied/Household	Rule/MC	Environment Actions

🏷️ Domain-Specific

Github Repo	Date	Org	Paper Link	RL Framework	Domain
MedSAM-Agent	2026.2	CUHK/Tencent	Paper	Custom	Medical
OS-R1	2025.8	ISCAS	Paper	Custom	OS/Systems
MMedAgent-RL	2025.8	Unknown	paper	Unknown	Medical
DoctorAgent-RL	2025.5	UCAS/CAS/USTC	Paper	RAGEN	Medical
Biomni	2025.3	Stanford University (SNAP)	Paper	Custom	Biomedical

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MedSAM-Agent	GRPO (via veRL)	Single	Both	Multi	Medical Image Segmentation	Model (clinical fidelity)	Yes (SAM/MedSAM2)
OS-R1	GRPO (via veRL)	Single	Outcome	Multi	Linux Kernel Tuning	Rule	Yes (LightRAG, kernel config)
MMedAgent-RL	Unknown	Multi	Unknown	Unknown	Unknown	Unknown	Unknown
DoctorAgent-RL	GRPO	Multi	Both	Multi	Consultation/Diagnosis	Model/Rule	No
Biomni	TBD	Single	TBD	Single	scRNAseq/CRISPR/ADMET/Knowledge	TBD	Yes

🎯 Reward & Training Methodology

Github Repo	Date	Org	Paper Link	Focus
ToolPRMBench	2026.1	Arizona State University	Paper	PRM Benchmark for Tool-Use
RLVR-World	2025.5	THU ML Group	Paper	RLVR for World Models
AgentPRM	2025.2	Cornell	Paper	Process Reward for Agents
Agentic-Reward-Modeling	2025.2	THU-KEG	Paper	Agentic Reward Agent
AgentRM	2025.2	THUNLP/Tsinghua	Paper	Generalizable Agent RM

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
ToolPRMBench	N/A (Benchmark)	Single	Process	Multi	Tool-Use	Rule/Model	Yes
RLVR-World	RLVR	Single	Outcome	Multi	World Modeling (Language/Video)	Model (verifiable)	No
AgentPRM	PPO/DPO + PRM	Single	Process	Multi	ALFWorld/General	Model (PRM)	Yes
Agentic-Reward-Modeling	DPO/Best-of-N	Single	Outcome	Single	General Instruction	Model (Reward Agent)	Yes (Verification)
AgentRM	MCTS/RM-guided	Single	Outcome	Multi	9 Agent Tasks	Model (regression PRM)	Yes

🛡️ Safety

Github Repo	Date	Org	Paper Link	RL Framework
SafeSearch	2025.11	Amazon Science	Paper	veRL
curiosity_redteam	2024.2	MIT	Paper	Custom
RLbreaker	2024.6	Purdue	Paper	Custom
xJailbreak	2025.1	Academic	Paper	Custom
Auto-RT	2025.1	ICIP-CAS	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
SafeSearch	PPO (GAE/GRPO)	Single	Both	Multi	Safe QA/Search	Rule + Model	Search
curiosity_redteam	RL + Curiosity	Single	Outcome	Multi	Red Teaming	Model	Yes (iterative query)
RLbreaker	Custom PPO	Single	Outcome	Multi	Jailbreaking	Model	Yes (mutator selection)
xJailbreak	RL	Single	Outcome	Multi	Jailbreaking	Model (embedding)	Yes (iterative)
Auto-RT	PPO	Single	Outcome	Multi	Red Teaming	Model	Yes (strategy exploration)

👁️ VLM Agent

Github Repo	Date	Org	Paper Link	RL Framework
multimodal-search-r1	2025.6	ByteDance/NTU	Paper	Custom
DeepEyesV2	2025.11	Xiaohongshu	Paper	Custom
VDeepEyes	2025.5	Xiaohongshu/XJTU	Paper	veRL
CoSo	2025.5	NTU/Alibaba	Paper	Custom
RL4VLM	2024.5	UC Berkeley	Paper	Custom
VSC-RL	2025.2	Liverpool/Huawei/Tianjin/UCL	Paper	Custom
AlphaDrive	2025.3	HUST/Horizon Robotics	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
multimodal-search-r1	GRPO	Single	Outcome	Multi	Multimodal Search	Rule	Yes (Search)
DeepEyesV2	Outcome RL	Single	Outcome	Multi	Multimodal Reasoning	Rule	Yes (Code exec, Web search)
VDeepEyes	PPO/GRPO	Multi	Process	Multi	VQA	All	Yes
CoSo	Soft RL (counterfactual)	Single	Outcome	Multi	Android/Card/Embodied	Rule	Yes
RL4VLM	PPO	Single	Outcome	Multi	GymCards/ALFWorld	Rule	Yes
VSC-RL	Variational RL	Single	Outcome	Multi	Mobile Device Control	Rule	Yes
AlphaDrive	GRPO	Single	Outcome	Multi	Autonomous Driving	Rule (4 planning rewards)	No

🔄 Self-Evolution

⚠️ Note: The definition of "Self-Evolution" in the context of RL for LLM agents is still evolving and not yet well-established. This category currently collects works whose paper titles explicitly contain "self-evolving" or "self-evolution", where the agent improves itself through RL-driven feedback loops.

Github Repo	Date	Org	Paper Link	RL Framework
AgentEvolver	2025.11	Alibaba/Tongyi Lab	Paper	Custom
SEAgent	2025.8	Shanghai AI Lab / CUHK	Paper	Custom
MemSkill	2026.2	NTU/UIUC/UIC/Tsinghua	Paper	Custom
MemRL	2026.1	SJTU/Xidian/NUS/USTC/MemTensor	Paper	Custom
RAGEN	2025.1	RAGEN-AI	Paper	veRL
WebRL	2024.11	Tsinghua/Zhipu AI	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
AgentEvolver	ADCA-GRPO	Single	Outcome	Multi	Social Game/Tool-use	Rule	Yes
SEAgent	GRPO	Single	Outcome	Multi	Computer Use (OSWorld)	Model	Yes (Screenshot-based)
MemSkill	PPO	Single	Process	Multi	QA/ALFWorld	Model (learned skills)	Yes
MemRL	RL-based (Q-value)	Single	Process	Multi	HLE/BigCodeBench/ALFWorld	Model (retrieval)	Yes
RAGEN	PPO/GRPO (StarPO)	Single	Both	Multi	TextGame	All	Yes
WebRL	Actor-Critic RL + ORM	Single	Outcome	Multi	Web Navigation (WebArena)	Model (ORM)	Yes (Web browsing)

⛰️ Environment

Github Repo	Date	Org	Task
OpenSandbox	2026.3	Alibaba	Code/GUI/Agent Eval
OpenEnv	2026.3	Meta (PyTorch)	Chess/Arcade/Finance
NeMo-Gym	2026.1	NVIDIA	Multi-step/Multi-turn
open-trajectory-gym	2026.3	Individual	CTF/Security
R2E-Gym	2025.4	UC Berkeley/ANU	SWE
LoCoBench-Agent	2025.11	Salesforce AI Research	SWE
Simia-Agent-Training	2025.10	Microsoft	ToolUse/API
PaperArena	2025.9	University of Science and Technology of China	ScientificLiteratureQA
enterprise-deep-research	2025.9	Salesforce AI Research	DeepResearch
CompassVerifier	2025.7	Shanghai AI Lab	Reasoning
SWE-smith	2025.4	Princeton/Stanford/SWE-bench	SWE
SWE-Gym	2024.12	UC Berkeley/UIUC/CMU/Apple	SWE
Mind2Web-2	2025.6	Ohio State University	Web
gem	2025.5	Sea AI Lab	Math/Code/Game/QA
MLE-Dojo	2025.5	GIT, Stanford	MLE
atropos	2025.4	Nous Research	Game/Code/Tool
InternBootcamp	2025.4	InternBootcamp	Coding/QA/Game
loong	2025.3	CAMEL-AI.org	RLVR
DataSciBench	2025.2	Tsinghua	data analysis
reasoning-gym	2025.1	open-thought	Math/Game
llmgym	2025.1	tensorzero	TextGame/Tool
debug-gym	2024.11	Microsoft Research	Debugging/Game/Code
gym-llm	2024.8	Rodrigo Sánchez Molina	Control/Game
AgentGym	2024.6	Fudan	Web/Game
tau-bench	2024.6	Sierra	Tool
appworld	2024.6	Stony Brook University	Phone Use
android_world	2024.5	Google Research	Phone Use
TheAgentCompany	2024.3	CMU, Duke	Coding
LlamaGym	2024.3	Rohan Pandey	Game
visualwebarena	2024.1	CMU	Web
LMRL-Gym	2023.12	UC Berkeley	Game
OSWorld	2023.10	HKU, CMU, Salesforce, Waterloo	Computer Use
webarena	2023.7	CMU	Web
AgentBench	2023.7	Tsinghua University	Game/Web/QA/Tool
WebShop	2022.7	Princeton-NLP	Web
ScienceWorld	2022.3	AllenAI	TextGame/ScienceQA
alfworld	2020.10	Microsoft, CMU, UW	Embodied
factorio-learning-environment	2021.6	JackHopkins	Game
jericho	2018.10	Microsoft, GIT	TextGame
TextWorld	2018.6	Microsoft Research	TextGame

Under Review/Waiting for Open Source

Star History

Citation

If you find this repository useful, please consider citing it:

@misc{agentsMeetRL,
  title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey},
  author={AgentsMeetRL Contributors},
  year={2025},
  url={https://github.com/thinkwee/agentsMeetRL}
}

Made with ❤️ by the AgentsMeetRL community

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
README.md		README.md
index.html		index.html
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

When LLM Agents Meet Reinforcement Learning

Updates

🔧 Base Framework

💪 General/MultiTask

🔍 Search & RAG Agent

🌐 Web & GUI Agent

🔨 Tool-Use Agent

💻 Code & SWE Agent

🤔 Reasoning Agent

👥 Multi-Agent RL

🧠 Memory

🦾 Embodied

🏷️ Domain-Specific

🎯 Reward & Training Methodology

🛡️ Safety

👁️ VLM Agent

🔄 Self-Evolution

⛰️ Environment

Under Review/Waiting for Open Source

Star History

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

When LLM Agents Meet Reinforcement Learning

Updates

🔧 Base Framework

💪 General/MultiTask

🔍 Search & RAG Agent

🌐 Web & GUI Agent

🔨 Tool-Use Agent

💻 Code & SWE Agent

🤔 Reasoning Agent

👥 Multi-Agent RL

🧠 Memory

🦾 Embodied

🏷️ Domain-Specific

🎯 Reward & Training Methodology

🛡️ Safety

👁️ VLM Agent

🔄 Self-Evolution

⛰️ Environment

Under Review/Waiting for Open Source

Star History

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages