AgentsMeetRL is an awesome list that summarizes open-source repositories for training LLM Agents using reinforcement learning:
- 🤖 The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo).
⚠️ This project is based on code analysis from open-source repositories using LLM coding agents, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!- 🚀 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table.
- 📅 Last updated: 2026-03-24
- 🤗 Feel free to submit your own projects anytime - we welcome contributions!
Taxonomy:
- Base Framework: General-purpose RL training frameworks for LLM agents (e.g., veRL, OpenRLHF, trl)
- General/MultiTask: Agent systems trained/evaluated across multiple tasks or environments
- Search & RAG: Search-augmented reasoning agents that use retrieval tools to enhance LLM reasoning
- Web & GUI: Agents that interact with web browsers, mobile/desktop GUIs, or operating systems
- Tool-Use: Agents trained to invoke external tools (APIs, code executors, MCP, etc.)
- Code & SWE: Software engineering and code generation agents
- Reasoning: Reasoning agents with tool-integrated or multi-turn reasoning (math, QA, visual)
- Multi-Agent RL: Multi-agent collaboration, negotiation, or credit assignment via RL
- Memory: Agents that learn to manage, retrieve, or evolve memory
- Embodied: Agents operating in embodied/physical simulation environments
- Domain-Specific: RL agents for specialized domains (medical, OS tuning, etc.)
- Reward & Training: Process/outcome reward models and training methodologies for agents
- Safety: RL for agent safety alignment, adversarial red-teaming, and jailbreak defense/attack
- VLM Agent: Vision-language model agents trained with RL for multimodal interaction
- Self-Evolution: Agents that self-evolve via RL feedback loops (
⚠️ definition still evolving in the community) - Environment: Benchmarks, gyms, and sandbox environments for agent training/evaluation
Some Enumeration:
- Enumeration for Reward Type:
- External Verifier: e.g., a compiler or math solver
- Rule-Based: e.g., a LaTeX parser with exact match scoring
- Model-Based: e.g., a trained verifier LLM or reward LLM
- Custom
- 📢 2026-03 Update: Restructured taxonomy from 12 to 16 categories. Added ~70 new repositories covering Sep 2025 – Mar 2026. New categories include Multi-Agent RL, Reward & Training, Safety, VLM Agent, Self-Evolution, and Domain-Specific. Merged the old GUI and Web into Web & GUI, retired TextGame and Biomedical as standalone categories. Total repos grew from ~134 to 205.
| Github Repo | 🌟 Stars | Date | Org | Paper Link |
|---|---|---|---|---|
| Open-AgentRL | 2026.2 | Gen-Verse | Paper | |
| OpenClaw-RL | 2026.3 | Gen-Verse | Paper | |
| Claw-R1 | 2026.3 | USTC | -- | |
| prime-rl | 2025.2 | Prime Intellect | -- | |
| NeMo-RL | 2026.1 | NVIDIA | -- | |
| RLinf | 2025.8 | Tsinghua/Infinigence AI/PKU | Paper | |
| siiRL | 2025.7 | Shanghai Innovation Institute | Paper | |
| slime | 2025.6 | Tsinghua University (THUDM) | blog | |
| agent-lightning | 2025.6 | Microsoft Research | Paper | |
| AReaL | 2025.6 | AntGroup/Tsinghua | Paper | |
| ROLL | 2025.6 | Alibaba | Paper | |
| MARTI | 2025.5 | Tsinghua | -- | |
| RL2 | 2025.4 | Accio | – | |
| verifiers | 2025.3 | Individual | -- | |
| oat | 2024.11 | NUS/Sea AI | Paper | |
| veRL | 2024.10 | ByteDance | Paper | |
| OpenRLHF | 2023.7 | OpenRLHF | Paper | |
| trl | 2019.11 | HuggingFace | -- |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| Open-AgentRL | GRPO-TCR | Single | Both | Multi | Reasoning/GUI/Coding | Model (PRM) | Yes (SandboxFusion) |
| OpenClaw-RL | GRPO/OPD | Both | Both | Multi | Terminal/GUI/SWE/Tool-call | Model/External | Yes |
| Claw-R1 | Generic RL Framework | Multi | Both | Multi | General Agent | All | Yes (Framework-agnostic) |
| prime-rl | GRPO/PPO | Multi | Outcome | Multi | Math/Code/Search | Model/External | Yes |
| NeMo-RL | GRPO/DAPO/GDPO/DPO | Single | Outcome | Multi | Math/Reasoning/Code | Rule/External | No |
| RLinf | PPO/GRPO/DAPO/SAC/REINFORCE++/CrossQ/RLPD | Both | Both | Multi | Robotics/Math/Code/QA/VQA | All (Rule/Model/External) | Yes |
| siiRL | PPO/GRPO/CPGD/MARFT | Multi | Both | Multi | LLM/VLM/LLM-MAS PostTraining | Model/Rule | Planned |
| slime | GRPO/GSPO/REINFORCE++ | Single | Both | Both | Math/Code | External Verifier | Yes |
| agent-lightning | PPO/Custom/Automatic Prompt Optimization | Multi | Outcome | Multi | Calculator/SQL | Model/External/Rule | Yes |
| AReaL | PPO | Both | Outcome | Both | Math/Code | External | Yes |
| ROLL | PPO/GRPO/Reinforce++/TOPR/RAFT++ | Multi | Both | Multi | Math/QA/Code/Alignment | All | Yes |
| MARTI | PPO/GRPO/REINFORCE++/TTRL | Multi | Both | Multi | Math | All | Yes |
| RL2 | Dr. GRPO/PPO/DPO | Single | Both | Both | QA/Dialogue | Rule/Model/External | Yes |
| verifiers | GRPO | Multi | Outcome | Both | Reasoning/Math/Code | All | Code |
| oat | PPO/GRPO | Single | Outcome | Multi | Math/Alignment | External | No |
| veRL | PPO/GRPO | Single | Outcome | Both | Math/QA/Reasoning/Search | All | Yes |
| OpenRLHF | PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO | Multi | Both | Both | Dialogue/Chat/Completion | Rule/Model/External | Yes |
| trl | PPO/GRPO/DPO | Single | Both | Single | QA | Custom | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MetaClaw | 2026.3 | UNC-Chapel Hill (AIMING Lab) | Paper | Custom | |
| SkillRL | 2026.2 | UNC-Chapel Hill (AIMING Lab) | Paper | Custom | |
| LLM-in-Sandbox | 2026.1 | RUC/MSRA/THU | Paper | rllm (w/ veRL) | |
| youtu-agent | 2025.12 | Tencent Youtu Lab | Paper | Custom | |
| DEPO | 2025.11 | HKUST/SJTU | Paper | LLaMA-Factory | |
| SPEAR | 2025.10 | Tencent Youtu Lab | Paper | veRL/verl-agent | |
| DeepAgent | 2025.10 | RUC/Xiaohongshu | Paper | Custom | |
| AgentRL | 2025.9 | Tsinghua | Paper | veRL | |
| AgentGym-RL | 2025.9 | Fudan University | Paper | veRL | |
| Agent_Foundation_Models | 2025.8 | OPPO Personal AI Lab | Paper | veRL | |
| Trinity-RFT | 2025.5 | Alibaba | Paper | veRL | |
| SPA-RL-Agent | 2025.5 | PolyU | Paper | TRL | |
| verl-agent | 2025.5 | NTU/Skywork | Paper | veRL | |
| VAGEN | 2025.3 | RAGEN-AI | Paper | veRL | |
| ART | 2025.3 | OpenPipe | Paper | TRL | |
| OpenManus-RL | 2025.3 | UIUC/MetaGPT | -- | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MetaClaw | GRPO (LoRA) | Single | Process | Multi | General Agentic | Model (PRM) | Yes (Skill-augmented) |
| SkillRL | GRPO | Single | Outcome | Multi | ALFWorld/WebShop/Search | Rule | Yes (Web search, actions) |
| LLM-in-Sandbox | GRPO++ | Single | Outcome | Multi | Math/Physics/Chemistry/Biomedicine/Long-context/IF/SWE | Rule | Yes (Code Sandbox w/ Terminal, File, Internet) |
| youtu-agent | Training-Free GRPO | Single | Outcome | Multi | Deep Research/Data Analysis/Tool-use | Model/External | Yes (Web search, code, file) |
| DEPO | KTO + Efficiency Loss | Single | Both | Multi | Agent (BabyAI/WebShop) | Rule | Yes |
| SPEAR | GRPO/GiGPO + SIL | Single | Both | Multi | Math/Agent | Rule/External | Yes (Search, Sandbox, Browser) |
| DeepAgent | ToolPO | Single | Outcome | Multi | ToolBench/ALFWorld/WebShop/GAIA/HLE | Model | Yes (16,000+ RapidAPIs) |
| AgentRL | GRPO/REINFORCE++/RLOO/ReMax/GAE | Single | Outcome | Multi | Agent Tasks | External | Yes |
| AgentGym-RL | PPO/GRPO/RLOO/REINFORCE++ | Single | Outcome | Multi | Web/Search/Game/Embodied/Science | Rule/Model/External | Yes (Web, Search, Env APIs) |
| Agent_Foundation_Models | DAPO/PPO | Single | Outcome | Single | QA/Code/Math | Rule/External | Yes |
| Trinity-RFT | PPO/GRPO | Single | Outcome | Both | Math/TextGame/Web | All | Yes |
| SPA-RL-Agent | PPO | Single | Process | Multi | Navigation/Web/TextGame | Model | No |
| verl-agent | PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++ | Multi | Both | Multi | Phone Use/Math/Code/Web/TextGame | All | Yes |
| VAGEN | PPO/GRPO | Single | Both | Multi | TextGame/Navigation | All | Yes |
| ART | GRPO | Multi | Both | Multi | TextGame | All | Yes |
| OpenManus-RL | PPO/DPO/GRPO | Multi | Outcome | Multi | TextGame | All | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| ProRAG | 2026.1 | RUC | Paper | Custom | |
| MemSearcher | 2025.11 | CAS | Paper | Custom | |
| ReSeek | 2025.10 | Tencent PCG BAC/Tsinghua University | Paper | veRL | |
| AutoGraph-R1 | 2025.10 | HKUST KnowComp | Paper | Custom | |
| Tree-GRPO | 2025.9 | AMAP | Paper | veRL | |
| ASearcher | 2025.8 | Ant Research RL Lab Tsinghua University & UW |
Paper | RealHF/AReaL | |
| Graph-R1 | 2025.7 | BUPT/NTU/NUS | Paper | veRL | |
| Kimi-Researcher | 2025.6 | Moonshot AI | blog | Custom | |
| R-Search | 2025.6 | Individual | -- | veRL | |
| R1-Searcher-plus | 2025.5 | RUC | Paper | Custom | |
| StepSearch | 2025.5 | SenseTime | Paper | veRL | |
| AutoRefine | 2025.5 | USTC | Paper | veRL | |
| ZeroSearch | 2025.5 | Alibaba | Paper | veRL | |
| ReasonRAG | 2025.5 | CityU HK / Huawei | Paper | Custom | |
| Agentic-RAG-R1 | 2025.12 | PKU | -- | Custom | |
| WebThinker | 2025.4 | RUC | Paper | Custom | |
| DeepResearcher | 2025.4 | SJTU | Paper | veRL | |
| Search-R1 | 2025.3 | UIUC/Google | paper1, paper2 | veRL | |
| R1-Searcher | 2025.3 | RUC | Paper | OpenRLHF | |
| C-3PO | 2025.2 | Alibaba | Paper | OpenRLHF | |
| DeepRetrieval | 2025.2 | UIUC | Paper | veRL | |
| SSRL | 2025.8 | Tsinghua | Paper | Custom | |
| Research-Venus | 2025.8 | Ant Group | Paper | Custom | |
| DeepResearch | 2025.9 | Alibaba/Tongyi Lab | Paper | Custom | |
| DeepDive | 2025.9 | Tsinghua/THUDM | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| ProRAG | GRPO + DGA (dual-granularity advantage) | Single | Both | Multi | Multi-hop RAG | Model (PRM via MCTS) | Yes (Retrieval) |
| MemSearcher | Multi-context GRPO | Single | Outcome | Multi | Search/QA + Memory | Rule/Model | Yes (Web search + Memory) |
| ReSeek | GRPO/PPO | Single | Both | Multi | QA/Search | Rule | Search/JUDGE |
| AutoGraph-R1 | GRPO (via VeRL) | Single | Outcome | Multi | KG Construction for QA | Rule | Yes (Graph retrieval) |
| Tree-GRPO | GRPO/Tree-GRPO | Single | Outcome | Multi | Search | Rule | Search |
| ASearcher | PPO/GRPO + Decoupled PPO | Single | Outcome | Multi | Math/Code/SearchQA | External/Rule | Yes |
| Graph-R1 | GRPO/REINFORCE++/PPO | Single | Outcome | Multi | KGQA | Rule (EM/F1) | Yes (Graph retrieval) |
| Kimi-Researcher | REINFORCE | Single | Outcome | Multi | Research | Outcome | Search, Browse, Coding |
| R-Search | PPO/GRPO | Single | Both | Multi | QA/Search | All | Yes |
| R1-Searcher-plus | Custom | Single | Outcome | Multi | Search | Model | Search |
| StepSearch | PPO | Single | Process | Multi | QA | Model | Search |
| AutoRefine | PPO/GRPO | Multi | Both | Multi | RAG QA | Rule | Search |
| ZeroSearch | PPO/GRPO/REINFORCE | Single | Outcome | Multi | QA/Search | Rule | Yes |
| ReasonRAG | DPO + MCTS-based PRM | Single | Process | Multi | Multi-hop QA | Model (PRM) | Yes (Wikipedia search) |
| Agentic-RAG-R1 | GRPO | Single | Outcome | Multi | Knowledge-intensive QA | Rule/Model | Yes (Wiki/Doc search) |
| WebThinker | DPO | Single | Outcome | Multi | Reasoning/QA/Research | Model/External | Web Browsing |
| DeepResearcher | PPO/GRPO | Multi | Outcome | Multi | Research | All | Yes |
| Search-R1 | PPO/GRPO | Single | Outcome | Multi | Search | All | Search |
| R1-Searcher | PPO/DPO | Single | Both | Multi | Search | All | Yes |
| C-3PO | PPO | Multi | Outcome | Multi | Search | Model | Yes |
| DeepRetrieval | GRPO | Single | Outcome | Multi | Query Generation/IR | Rule | Yes (Search) |
| SSRL | GRPO | Single | Outcome | Multi | Self-Search | Rule | Yes (Self-search) |
| Research-Venus | GRPO | Single | Both | Multi | Deep Research | Model (atomic thought) | Yes (Search) |
| DeepResearch | RL-based | Single | Outcome | Multi | Deep Research | Model | Yes (Search, Browse) |
| DeepDive | GRPO | Single | Outcome | Multi | KG-augmented Search | Rule | Yes (KG + Search) |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MobileAgent | 2025.9 | X-PLUG (TongyiQwen) | paper | veRL | |
| InfiGUI-G1 | 2025.8 | InfiX AI | Paper | veRL | |
| UI-AGILE | 2025.7 | Xiamen University | Paper | Custom | |
| gui-rcpo | 2025.8 | Zhejiang University | Paper | Custom | |
| Grounding-R1 | 2025.6 | Salesforce | blog | trl | |
| AgentCPM-GUI | 2025.6 | OpenBMB/Tsinghua/RUC | Paper | Huggingface | |
| TTI | 2025.6 | CMU | Paper | Custom | |
| SE-GUI | 2025.5 | Nankai University/vivo | Paper | trl | |
| ARPO | 2025.5 | CUHK/HKUST | Paper | veRL | |
| GUI-G1 | 2025.5 | RUC | Paper | TRL | |
| WebAgent-R1 | 2025.5 | Amazon/UVA | Paper | Custom | |
| GUI-R1 | 2025.4 | CAS/NUS | Paper | veRL | |
| UI-R1 | 2025.3 | vivo/CUHK | Paper | TRL | |
| CollabUIAgents | 2025.2 | Tsinghua/Alibaba/HKUST | Paper | Custom | |
| WebAgent | 2025.1 | Alibaba | paper1, paper2 | LLaMA-Factory | |
| UI-TARS | 2025.9 | ByteDance Seed | Paper | Custom | |
| DigiQ | 2025.2 | UC Berkeley/CMU/Amazon | Paper | Custom | |
| ZeroGUI | 2025.5 | Shanghai AI Lab | Paper | Custom | |
| InfiGUI-R1 | 2025.4 | Zhejiang University | Paper | Custom | |
| GUI-Agent-RL | 2025.2 | Microsoft | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MobileAgent | semi-online RL | Single | Both | Multi | MobileGUI/Automation | Rule | Yes |
| InfiGUI-G1 | AEPO | Single | Outcome | Single | GUI/Grounding | Rule | No |
| UI-AGILE | GRPO | Single | Outcome | Single | GUI Grounding | Rule (continuous) | No |
| gui-rcpo | RCPO | Single | Outcome | Single | GUI Grounding | Rule (self-supervised) | No |
| Grounding-R1 | GRPO | Single | Outcome | Multi | GUI Grounding | Model | Yes |
| AgentCPM-GUI | GRPO | Single | Outcome | Multi | Mobile GUI | Model | Yes |
| TTI | REINFORCE/BC | Single | Outcome | Multi | Web | External | Web Browsing |
| SE-GUI | GRPO | Single | Both | Single | GUI Grounding | Rule | Yes |
| ARPO | GRPO | Single | Outcome | Multi | GUI | External | Computer Use |
| GUI-G1 | GRPO | Single | Outcome | Single | GUI | Rule/External | No |
| WebAgent-R1 | M-GRPO | Single | Outcome | Multi | Web Navigation (WebArena-Lite) | Rule (task success) | Yes (Web browsing) |
| GUI-R1 | GRPO | Single | Outcome | Multi | GUI | Rule | No |
| UI-R1 | GRPO | Single | Process | Both | GUI | Rule | Computer/Phone Use |
| CollabUIAgents | DPO (credit re-assignment) | Multi | Process | Multi | GUI (Mobile + Web) | Model (LLM) | Yes (GUI interaction) |
| WebAgent | DAPO | Multi | Process | Multi | Web | Model | Yes |
| UI-TARS | Multi-turn RL | Single | Both | Multi | GUI (Cross-platform) | Model | Yes (GUI actions) |
| DigiQ | Value-based offline RL | Single | Outcome | Multi | Android Device Control | Model (Q-function) | Yes |
| ZeroGUI | Online RL | Single | Outcome | Multi | GUI Agent | Rule | Yes (GUI actions) |
| InfiGUI-R1 | RL + sub-goal guidance | Single | Both | Multi | GUI Reasoning | Rule | Yes |
| GUI-Agent-RL | Value-based RL (VEM) | Single | Outcome | Multi | GUI (Web Shopping) | Model | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MATPO | 2025.10 | MiroMind AI | Paper | Custom | |
| MiroRL | 2025.8 | MiroMindAI | HF Repo | veRL | |
| verl-tool | 2025.6 | TIGER-Lab | X | veRL | |
| Multi-Turn-RL-Agent | 2025.5 | University of Minnesota | Paper | Custom | |
| Tool-N1 | 2025.5 | NVIDIA | Paper | veRL | |
| Tool-Star | 2025.5 | RUC | Paper | LLaMA-Factory | |
| RL-Factory | 2025.5 | Simple-Efficient | model | veRL | |
| ReTool | 2025.4 | ByteDance | Paper | veRL | |
| AWorld | 2025.3 | Ant Group (inclusionAI) | Paper | veRL | |
| Agent-R1 | 2025.3 | USTC | Paper | veRL | |
| ReCall | 2025.3 | BaiChuan | Paper | veRL | |
| ToolRL | 2025.4 | UIUC | Paper | veRL |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MATPO | GRPO (multi-agent) | Multi | Outcome | Multi | Tool-use/Search | Rule | Yes (MCP: Serper, Web scraping) |
| MiroRL | GRPO | Single | Both | Multi | Reasoning/Planning/ToolUse | Rule-based | MCP |
| verl-tool | PPO/GRPO | Single | Both | Both | Math/Code | Rule/External | Yes |
| Multi-Turn-RL-Agent | GRPO | Single | Both | Multi | Tool-use/Math | Rule/External | Yes |
| Tool-N1 | PPO | Single | Outcome | Multi | Math/Dialogue | All | Yes |
| Tool-Star | PPO/DPO/ORPO/SimPO/KTO | Single | Outcome | Multi | Multi-modal/Tool Use/Dialogue | Model/External | Yes |
| RL-Factory | GRPO | Multi | Both | Multi | Tool-use/NL2SQL | All | MCP |
| ReTool | PPO | Single | Outcome | Multi | Math | External | Code |
| AWorld | GRPO | Both | Outcome | Multi | Search/Web/Code | External/Rule | Yes |
| Agent-R1 | PPO/GRPO | Single | Both | Multi | Tool-use/QA | Model | Yes |
| ReCall | PPO/GRPO/RLOO/REINFORCE++/ReMax | Single | Outcome | Multi | Tool-use/Math/QA | All | Yes |
| ToolRL | GRPO/PPO | Single | Outcome | Multi | Tool Learning | Rule/External | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| CUDA-Agent | 2026.2 | ByteDance/Tsinghua | Paper | Custom | |
| LLM-in-Sandbox | 2026.1 | RUC/MSRA/THU | Paper | rllm (w/ veRL) | |
| PPP-Agent | 2025.11 | CMU/OpenHands | Paper | veRL | |
| RepoDeepSearch | 2025.8 | PKU, Bytedance, BIT | Paper | veRL | |
| CUDA-L1 | 2025.7 | DeepReinforce AI | Paper | Custom | |
| MedAgentGym | 2025.6 | Emory/Georgia Tech | Paper | Hugginface | |
| CURE | 2025.6 | University of Chicago Princeton/ByteDance |
Paper | Huggingface | |
| Time-R1 | 2025.5 | UIUC | Paper | veRL | |
| ML-Agent | 2025.5 | MASWorks | Paper | Custom | |
| SkyRL | 2025.4 | NovaSky | Paper | veRL | |
| digitalhuman | 2025.4 | Tencent | Paper | veRL | |
| sweet_rl | 2025.3 | Meta/UCB | Paper | OpenRLHF | |
| swe-rl | 2025.2 | Meta/UIUC/CMU | Paper | Custom | |
| rllm | 2025.1 | Berkeley Sky Computing Lab BAIR / Together AI |
Notion Blog | veRL | |
| open-r1 | 2025.1 | HuggingFace | -- | TRL | |
| R1-Code-Interpreter | 2025.5 | MIT | Paper | Custom | |
| CTRL | 2025.2 | HKU/ByteDance | Paper | Custom | |
| DeepAnalyze | 2025.10 | RUC/Tsinghua | Paper | Custom | |
| AceCoder | 2025.2 | Waterloo (TIGER-Lab) | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| CUDA-Agent | Agentic RL (staged) | Single | Outcome | Multi | CUDA Kernel Generation | Rule (correctness + performance) | Yes (compile/verify/profile) |
| LLM-in-Sandbox | GRPO++ | Single | Outcome | Multi | Code/SWE + General (Math/Sci/Bio) | Rule | Yes (Code Sandbox w/ Terminal, File, Internet) |
| PPP-Agent | PPP-RL | Single | Both | Multi | SWE/Research | Rule+Model | Search, Ask, Browse |
| RepoDeepSearch | GRPO | Single | Both | Multi | Search/Repair | Rule/External | Yes |
| CUDA-L1 | Contrastive RL | Single | Outcome | Single | CUDA Optimization | Rule (performance) | No |
| MedAgentGym | SFT/DPO/PPO/GRPO | Single | Outcome | Multi | Medical/Code | External | Yes |
| CURE | PPO | Single | Outcome | Single | Code | External | No |
| Time-R1 | PPO/GRPO/DPO | Multi | Outcome | Multi | Temporal | All | Code |
| ML-Agent | Custom | Single | Process | Multi | Code | All | Yes |
| SkyRL | PPO/GRPO | Single | Outcome | Multi | Math/Code | All | Code |
| digitalhuman | PPO/GRPO/ReMax/RLOO | Multi | Outcome | Multi | Empathy/Math/Code/MultimodalQA | Rule/Model/External | Yes |
| sweet_rl | DPO | Multi | Process | Multi | Design/Code | Model | Web Browsing |
| swe-rl | RL-based | Single | Outcome | Single | SWE (SWE-bench) | Rule (similarity) | No |
| rllm | PPO/GRPO | Single | Outcome | Multi | Code Edit | External | Yes |
| open-r1 | GRPO | Single | Outcome | Single | Math/Code | All | Yes |
| R1-Code-Interpreter | GRPO | Single | Outcome | Multi | Code Interpretation | Rule/External | Yes (Code exec) |
| CTRL | RL (critique-revision) | Single | Process | Multi | Code Refinement | Model | Yes (Code exec) |
| DeepAnalyze | Curriculum RL | Single | Outcome | Multi | Data Science | Rule/External | Yes (Code exec) |
| AceCoder | GRPO | Single | Outcome | Single | Code Generation | External (test cases) | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| Agent0 | 2025.10 | UNC‑Chapel Hill / Salesforce Research / Stanford University | Paper | veRL | |
| KG-R1 | 2025.9 | UIUC/Google | Paper1, Paper2 | veRL | |
| AgentFlow | 2025.09 | Stanford University | arXiv | veRL | |
| ARPO | 2025.7 | RUC, Kuaishou | Paper | veRL | |
| terminal-bench-rl | 2025.7 | Individual (Danau5tin) | N/A | rLLM | |
| MOTIF | 2025.6 | University of Maryland | Paper | trl | |
| cmriat/l0 | 2025.6 | CMRIAT | Paper | veRL | |
| agent-distillation | 2025.5 | KAIST | Paper | Custom | |
| EasyR1 | 2025.4 | Individual | repo1/paper2 | veRL | |
| AutoCoA | 2025.3 | BJTU | Paper | veRL | |
| ToRL | 2025.3 | SJTU | Paper | veRL | |
| ReMA | 2025.3 | SJTU, UCL | Paper | veRL | |
| Agentic-Reasoning | 2025.2 | Oxford | Paper | Custom | |
| SimpleTIR | 2025.2 | NTU, Bytedance | Notion Blog | veRL | |
| openrlhf_async_pipline | 2024.5 | OpenRLHF | Paper | OpenRLHF |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| Agent0 | ADPO | Multi | Process | Multi | Math/Visual | Model/Verifier | Yes |
| KG-R1 | GRPO/PPO | Single | Both | Multi | KGQA | Rule/Model | KG Retrieval |
| AgentFlow | Flow-GRPO | Single | Outcome | Multi | Search/Math/QA | Model/External | Yes |
| ARPO | GRPO | Single | Outcome | Multi | Math/Coding | Model/Rule | Yes |
| terminal-bench-rl | GRPO | Single | Outcome | Multi | Coding/Terminal | Model+External Verifier | Yes |
| MOTIF | GRPO | Single | Outcome | Multi | QA | Rule | No |
| cmriat/l0 | PPO | Multi | Process | Multi | QA | All | Yes |
| agent-distillation | PPO | Single | Process | Multi | QA/Math | External | Yes |
| EasyR1 | GRPO | Single | Process | Multi | Vision-Language | Model | Yes |
| AutoCoA | GRPO | Multi | Outcome | Multi | Reasoning/Math/QA | All | Yes |
| ToRL | GRPO | Single | Outcome | Single | Math | Rule/External | Yes |
| ReMA | PPO | Multi | Outcome | Multi | Math | Rule | No |
| Agentic-Reasoning | Custom | Single | Process | Multi | QA/Math | External | Web Browsing |
| SimpleTIR | PPO/GRPO (with extensions) | Single | Outcome | Multi | Math, Coding | All | Yes |
| openrlhf_async_pipline | PPO/REINFORCE++/DPO/RLOO | Single | Outcome | Multi | Dialogue/Reasoning/QA | All | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| PettingLLMs | 2025.10 | Intel / UCSD | Paper | Custom | |
| MASPRM | 2025.10 | UBC / Huawei | Paper | Custom | |
| ARIA | 2025.6 | Fudan University | Paper | Custom | |
| AMPO | 2025.5 | Tongyi Lab, Alibaba | Paper | veRL | |
| MAPoRL | 2025.8 | Academic | -- | Custom | |
| FlowReasoner | 2025.4 | Sea AI Lab / NUS | Paper | Custom | |
| DrMAS | 2026.2 | NTU | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| PettingLLMs | AT-GRPO | Multi | Both | Multi | Game/Code/Math/Planning | Rule (verifiable) | No |
| MASPRM | PRM (trained from MCTS rollouts) | Multi | Process | Multi | Reasoning (GSM8K/MATH/MMLU) | Learned PRM | No |
| ARIA | REINFORCE | Both | Process | Multi | Negotiation/Bargaining | Other | No |
| AMPO | BC/AMPO(GRPO improvement) | Multi | Outcome | Multi | Social Interaction | Model-based | No |
| MAPoRL | PPO | Multi | Outcome | Multi | Collaborative LLM Tasks | Rule | No |
| FlowReasoner | GRPO | Multi | Outcome | Multi | Multi-agent Workflow Design | Rule | Yes |
| DrMAS | GRPO (agent-wise) | Multi | Outcome | Multi | Multi-agent LLM Systems | Rule | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MEM1 | 2025.7 | MIT | Paper | veRL (based on Search-R1) | |
| Memento | 2025.6 | UCL, Huawei | Paper | Custom | |
| MemAgent | 2025.6 | Bytedance, Tsinghua-SIA | Paper | veRL |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MEM1 | PPO/GRPO | Single | Outcome | Multi | WebShop/GSM8K/QA | Rule/Model | Yes |
| Memento | soft Q-Learning | Single | Outcome | Multi | Research/QA/Code/Web | External/Rule | Yes |
| MemAgent | PPO, GRPO, DPO | Multi | Outcome | Multi | Long-context QA | Rule/Model/External | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| Embodied-R1 | 2025.6 | Tianjing University | Paper | veRL | |
| STeCa | 2025.2 | The Hong Kong Polytechnic University | Paper | FastChat/TRL |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| Embodied-R1 | GRPO | Single | Outcome | Single | Grounding/Waypoint | Rule | No |
| STeCa | DPO (RFT) | Single | Both | Multi | Embodied/Household | Rule/MC | Environment Actions |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | Domain |
|---|---|---|---|---|---|---|
| MedSAM-Agent | 2026.2 | CUHK/Tencent | Paper | Custom | Medical | |
| OS-R1 | 2025.8 | ISCAS | Paper | Custom | OS/Systems | |
| MMedAgent-RL | 2025.8 | Unknown | paper | Unknown | Medical | |
| DoctorAgent-RL | 2025.5 | UCAS/CAS/USTC | Paper | RAGEN | Medical | |
| Biomni | 2025.3 | Stanford University (SNAP) | Paper | Custom | Biomedical |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MedSAM-Agent | GRPO (via veRL) | Single | Both | Multi | Medical Image Segmentation | Model (clinical fidelity) | Yes (SAM/MedSAM2) |
| OS-R1 | GRPO (via veRL) | Single | Outcome | Multi | Linux Kernel Tuning | Rule | Yes (LightRAG, kernel config) |
| MMedAgent-RL | Unknown | Multi | Unknown | Unknown | Unknown | Unknown | Unknown |
| DoctorAgent-RL | GRPO | Multi | Both | Multi | Consultation/Diagnosis | Model/Rule | No |
| Biomni | TBD | Single | TBD | Single | scRNAseq/CRISPR/ADMET/Knowledge | TBD | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | Focus |
|---|---|---|---|---|---|
| ToolPRMBench | 2026.1 | Arizona State University | Paper | PRM Benchmark for Tool-Use | |
| RLVR-World | 2025.5 | THU ML Group | Paper | RLVR for World Models | |
| AgentPRM | 2025.2 | Cornell | Paper | Process Reward for Agents | |
| Agentic-Reward-Modeling | 2025.2 | THU-KEG | Paper | Agentic Reward Agent | |
| AgentRM | 2025.2 | THUNLP/Tsinghua | Paper | Generalizable Agent RM |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| ToolPRMBench | N/A (Benchmark) | Single | Process | Multi | Tool-Use | Rule/Model | Yes |
| RLVR-World | RLVR | Single | Outcome | Multi | World Modeling (Language/Video) | Model (verifiable) | No |
| AgentPRM | PPO/DPO + PRM | Single | Process | Multi | ALFWorld/General | Model (PRM) | Yes |
| Agentic-Reward-Modeling | DPO/Best-of-N | Single | Outcome | Single | General Instruction | Model (Reward Agent) | Yes (Verification) |
| AgentRM | MCTS/RM-guided | Single | Outcome | Multi | 9 Agent Tasks | Model (regression PRM) | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| SafeSearch | 2025.11 | Amazon Science | Paper | veRL | |
| curiosity_redteam | 2024.2 | MIT | Paper | Custom | |
| RLbreaker | 2024.6 | Purdue | Paper | Custom | |
| xJailbreak | 2025.1 | Academic | Paper | Custom | |
| Auto-RT | 2025.1 | ICIP-CAS | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| SafeSearch | PPO (GAE/GRPO) | Single | Both | Multi | Safe QA/Search | Rule + Model | Search |
| curiosity_redteam | RL + Curiosity | Single | Outcome | Multi | Red Teaming | Model | Yes (iterative query) |
| RLbreaker | Custom PPO | Single | Outcome | Multi | Jailbreaking | Model | Yes (mutator selection) |
| xJailbreak | RL | Single | Outcome | Multi | Jailbreaking | Model (embedding) | Yes (iterative) |
| Auto-RT | PPO | Single | Outcome | Multi | Red Teaming | Model | Yes (strategy exploration) |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| multimodal-search-r1 | 2025.6 | ByteDance/NTU | Paper | Custom | |
| DeepEyesV2 | 2025.11 | Xiaohongshu | Paper | Custom | |
| VDeepEyes | 2025.5 | Xiaohongshu/XJTU | Paper | veRL | |
| CoSo | 2025.5 | NTU/Alibaba | Paper | Custom | |
| RL4VLM | 2024.5 | UC Berkeley | Paper | Custom | |
| VSC-RL | 2025.2 | Liverpool/Huawei/Tianjin/UCL | Paper | Custom | |
| AlphaDrive | 2025.3 | HUST/Horizon Robotics | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| multimodal-search-r1 | GRPO | Single | Outcome | Multi | Multimodal Search | Rule | Yes (Search) |
| DeepEyesV2 | Outcome RL | Single | Outcome | Multi | Multimodal Reasoning | Rule | Yes (Code exec, Web search) |
| VDeepEyes | PPO/GRPO | Multi | Process | Multi | VQA | All | Yes |
| CoSo | Soft RL (counterfactual) | Single | Outcome | Multi | Android/Card/Embodied | Rule | Yes |
| RL4VLM | PPO | Single | Outcome | Multi | GymCards/ALFWorld | Rule | Yes |
| VSC-RL | Variational RL | Single | Outcome | Multi | Mobile Device Control | Rule | Yes |
| AlphaDrive | GRPO | Single | Outcome | Multi | Autonomous Driving | Rule (4 planning rewards) | No |
⚠️ Note: The definition of "Self-Evolution" in the context of RL for LLM agents is still evolving and not yet well-established. This category currently collects works whose paper titles explicitly contain "self-evolving" or "self-evolution", where the agent improves itself through RL-driven feedback loops.
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| AgentEvolver | 2025.11 | Alibaba/Tongyi Lab | Paper | Custom | |
| SEAgent | 2025.8 | Shanghai AI Lab / CUHK | Paper | Custom | |
| MemSkill | 2026.2 | NTU/UIUC/UIC/Tsinghua | Paper | Custom | |
| MemRL | 2026.1 | SJTU/Xidian/NUS/USTC/MemTensor | Paper | Custom | |
| RAGEN | 2025.1 | RAGEN-AI | Paper | veRL | |
| WebRL | 2024.11 | Tsinghua/Zhipu AI | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| AgentEvolver | ADCA-GRPO | Single | Outcome | Multi | Social Game/Tool-use | Rule | Yes |
| SEAgent | GRPO | Single | Outcome | Multi | Computer Use (OSWorld) | Model | Yes (Screenshot-based) |
| MemSkill | PPO | Single | Process | Multi | QA/ALFWorld | Model (learned skills) | Yes |
| MemRL | RL-based (Q-value) | Single | Process | Multi | HLE/BigCodeBench/ALFWorld | Model (retrieval) | Yes |
| RAGEN | PPO/GRPO (StarPO) | Single | Both | Multi | TextGame | All | Yes |
| WebRL | Actor-Critic RL + ORM | Single | Outcome | Multi | Web Navigation (WebArena) | Model (ORM) | Yes (Web browsing) |
| Github Repo | 🌟 Stars | Date | Org | Task |
|---|---|---|---|---|
| OpenSandbox | 2026.3 | Alibaba | Code/GUI/Agent Eval | |
| OpenEnv | 2026.3 | Meta (PyTorch) | Chess/Arcade/Finance | |
| NeMo-Gym | 2026.1 | NVIDIA | Multi-step/Multi-turn | |
| open-trajectory-gym | 2026.3 | Individual | CTF/Security | |
| R2E-Gym | 2025.4 | UC Berkeley/ANU | SWE | |
| LoCoBench-Agent | 2025.11 | Salesforce AI Research | SWE | |
| Simia-Agent-Training | 2025.10 | Microsoft | ToolUse/API | |
| PaperArena | 2025.9 | University of Science and Technology of China | ScientificLiteratureQA | |
| enterprise-deep-research | 2025.9 | Salesforce AI Research | DeepResearch | |
| CompassVerifier | 2025.7 | Shanghai AI Lab | Reasoning | |
| SWE-smith | 2025.4 | Princeton/Stanford/SWE-bench | SWE | |
| SWE-Gym | 2024.12 | UC Berkeley/UIUC/CMU/Apple | SWE | |
| Mind2Web-2 | 2025.6 | Ohio State University | Web | |
| gem | 2025.5 | Sea AI Lab | Math/Code/Game/QA | |
| MLE-Dojo | 2025.5 | GIT, Stanford | MLE | |
| atropos | 2025.4 | Nous Research | Game/Code/Tool | |
| InternBootcamp | 2025.4 | InternBootcamp | Coding/QA/Game | |
| loong | 2025.3 | CAMEL-AI.org | RLVR | |
| DataSciBench | 2025.2 | Tsinghua | data analysis | |
| reasoning-gym | 2025.1 | open-thought | Math/Game | |
| llmgym | 2025.1 | tensorzero | TextGame/Tool | |
| debug-gym | 2024.11 | Microsoft Research | Debugging/Game/Code | |
| gym-llm | 2024.8 | Rodrigo Sánchez Molina | Control/Game | |
| AgentGym | 2024.6 | Fudan | Web/Game | |
| tau-bench | 2024.6 | Sierra | Tool | |
| appworld | 2024.6 | Stony Brook University | Phone Use | |
| android_world | 2024.5 | Google Research | Phone Use | |
| TheAgentCompany | 2024.3 | CMU, Duke | Coding | |
| LlamaGym | 2024.3 | Rohan Pandey | Game | |
| visualwebarena | 2024.1 | CMU | Web | |
| LMRL-Gym | 2023.12 | UC Berkeley | Game | |
| OSWorld | 2023.10 | HKU, CMU, Salesforce, Waterloo | Computer Use | |
| webarena | 2023.7 | CMU | Web | |
| AgentBench | 2023.7 | Tsinghua University | Game/Web/QA/Tool | |
| WebShop | 2022.7 | Princeton-NLP | Web | |
| ScienceWorld | 2022.3 | AllenAI | TextGame/ScienceQA | |
| alfworld | 2020.10 | Microsoft, CMU, UW | Embodied | |
| factorio-learning-environment | 2021.6 | JackHopkins | Game | |
| jericho | 2018.10 | Microsoft, GIT | TextGame | |
| TextWorld | 2018.6 | Microsoft Research | TextGame |
- JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning
- Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
- Acting Less is Reasoning More! Teaching Model to Act Efficiently
- Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
- ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- MUA-RL: MULTI-TURN USER-INTERACTING AGENTREINFORCEMENT LEARNING FOR AGENTIC TOOL USE
- Understanding Tool-Integrated Reasoning
- Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
- Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
- WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
- EnvX: Agentize Everything with Agentic AI
- UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
- UI-Venus Technical Report: Building High-performance UI Agents with RFT
- Agent2 : An Agent-Generates-Agent Framework for Reinforcement Learning Automation
- Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use
- Adversarial Reinforcement Learning for Large Language Model Agent Safety
- Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction
- InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
If you find this repository useful, please consider citing it:
@misc{agentsMeetRL,
title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey},
author={AgentsMeetRL Contributors},
year={2025},
url={https://github.com/thinkwee/agentsMeetRL}
}Made with ❤️ by the AgentsMeetRL community
