Inference · vLLM Project

vLLM

High-throughput, memory-efficient inference engine for LLMs.

FREEOpen sourceSelf-hostLinuxCLIAPI

Open-source (Apache-2.0) serving engine for large language and vision-language models, originally from UC Berkeley's Sky Computing Lab. Its PagedAttention KV-cache management and continuous batching deliver high throughput on commodity GPUs. Now a community project with 1000s of contributors and an OpenAI-compatible server.

Model support

Multi-model

Llama
Qwen
Mistral
DeepSeek

Serves most Hugging Face transformer LLMs and vision-language models.

Where it runs

Linux
CLI
API

vLLM

High-throughput, memory-efficient inference engine for LLMs.

FREEOpen sourceSelf-hostLinuxCLIAPI

Model support

Multi-model

Llama
Qwen
Mistral
DeepSeek

Serves most Hugging Face transformer LLMs and vision-language models.

Where it runs

Linux
CLI
API

vLLM

Multi-model

Baseten

Cerebras

LiteLLM

Morph

SambaNova Cloud

fal

Groq

LM Studio

Ollama

OpenRouter

vLLM

Multi-model

Baseten

Cerebras

LiteLLM

Morph

SambaNova Cloud

fal

Groq

LM Studio

Ollama

OpenRouter