Loading…
Inference · vLLM Project
High-throughput, memory-efficient inference engine for LLMs.
Open-source (Apache-2.0) serving engine for large language and vision-language models, originally from UC Berkeley's Sky Computing Lab. Its PagedAttention KV-cache management and continuous batching deliver high throughput on commodity GPUs. Now a community project with 1000s of contributors and an OpenAI-compatible server.
Model support
Serves most Hugging Face transformer LLMs and vision-language models.
Where it runs
Tags
Related in Inference
Baseten
Inference cloud for serving any AI model in production.
Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.
AI insight: Models use its open-source 'Truss' packaging and scale to zero, so you pay per minute of active compute, not for idle GPUs.
Cerebras Systems
Wafer-scale inference cloud for open models.
Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.
AI insight: Runs models on a single dinner-plate-sized wafer instead of GPU clusters, hitting ~2,000 tokens/sec where GPU clouds plateau far lower.
BerriAI
AI gateway: call 100+ LLMs in one OpenAI-format interface.
Open-source Python SDK and proxy server (AI gateway) that exposes 100+ LLM providers through a single OpenAI-compatible API, with cost tracking, load balancing, fallbacks, caching, and guardrails. Self-host the proxy or use the managed cloud; a paid Enterprise tier adds SSO, audit logs, and support.
AI insight: Translates 100+ providers into one OpenAI-format call, so many other AI tools quietly embed it as their model-routing layer.
Morph
Fast models that apply AI code edits to files in milliseconds.
Infrastructure for coding agents centered on Fast Apply, a specialized model that merges AI-generated edits into files at ~10,500 tokens/sec instead of full-file rewrites or brittle search-and-replace. Also serves WarpGrep code search, context compaction, and a model router via an OpenAI-compatible API. Used in production by JetBrains, Vercel, and Webflow.
AI insight: Its Fast Apply model merges LLM code edits at ~10,500 tok/s — the dedicated write layer agents use instead of slow full-file rewrites.
SambaNova Systems
Fast inference for open models on custom RDU chips.
Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.
AI insight: One of the few clouds serving Llama 405B in native 16-bit precision at 100+ tokens/sec, not a quantized copy.
fal
Serverless inference API for image, video, audio, and 3D models.
A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.
AI insight: Specializes in generative-media latency — FLUX, Kling, Veo and 600+ media models — where general inference hosts focus on text.
Groq
Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.
GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.
AI insight: Speed comes from custom LPU silicon, not GPUs — which is why it serves open models at hundreds of tokens/sec on an OpenAI-compatible API.
LM Studio
Desktop app to discover, download, and run local LLMs privately.
A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.
AI insight: Free even for commercial use, though the app itself is closed-source — and it serves both OpenAI- and Anthropic-compatible local APIs.
Ollama
Run open-weight LLMs locally with one command. OpenAI-compatible API.
The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.
AI insight: Its OpenAI-compatible local server makes it a drop-in backend — point any app at localhost and swap the cloud for your own GPU.
OpenRouter
One OpenAI-compatible API in front of 300+ models from every provider.
A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.
AI insight: Swap among 300+ models by changing one string, with automatic fallback if a provider is down — and one consolidated bill.