Here are
10 public repositories
matching this topic...
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, MoE expert parallelism, OpenAI-compatible serving
Updated
Mar 28, 2026
Python
A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
Updated
Mar 25, 2026
Python
Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.
Updated
Mar 15, 2026
Python
Fork of OpenAI and Anthropic compatible server for Apple Silicon. Native MLX backend, 500+ tok/s. Run LLMs and vision-language models with continuous batching, MCP tool calling, and multimodal support.
Updated
Mar 20, 2026
Python
OpenAI-compatible server with continuous batching for MLX on Apple Silicon
Updated
Dec 4, 2025
Python
Updated
Jun 19, 2024
Jupyter Notebook
Adaptive LLM inference scheduler simulation — continuous batching, priority preemption, KV-cache routing, and speculative decoding in Python/asyncio.
Updated
Mar 10, 2026
Python
PagedAttention + Continuous Batching Inference Engine Prototype (Rust): Paged KV Cache & Dynamic Scheduling | PagedAttention + Continuous Batching 推理引擎原型(Rust),KV Cache 分页管理与动态调度
Updated
Mar 24, 2026
Rust
Process batches of large language model tasks efficiently using multithreading in C++ for faster and scalable LLM workflows.
Improve this page
Add a description, image, and links to the
continuous-batching
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
continuous-batching
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.