Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.
Inference · Baseten
Baseten
Inference cloud for serving any AI model in production.
Model support
Multi-model
- Llama
- DeepSeek
- Custom
Where it runs
- Web
- API
Tags
- #inference
- #model-serving
- #gpu
- #autoscaling
Related in Inference
View Cerebras details InferenceFREEMIUMCECerebras
Cerebras Systems
Wafer-scale inference cloud for open models.
Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.
AI insight: Runs models on a single dinner-plate-sized wafer instead of GPU clusters, hitting ~2,000 tokens/sec where GPU clouds plateau far lower.
- inference
- fast-inference
- wafer-scale
- open-models
View SambaNova Cloud details InferenceFREEMIUMSASambaNova Cloud
SambaNova Systems
Fast inference for open models on custom RDU chips.
Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.
AI insight: One of the few clouds serving Llama 405B in native 16-bit precision at 100+ tokens/sec, not a quantized copy.
- inference
- fast-inference
- open-models
- rdu
View fal details InferenceFREEMIUMFAfal
fal
Serverless inference API for image, video, audio, and 3D models.
A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.
AI insight: Specializes in generative-media latency — FLUX, Kling, Veo and 600+ media models — where general inference hosts focus on text.
- generative-media
- image-gen
- video-gen
- serverless
View Groq details InferenceFREEMIUMGRGroq
Groq
Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.
GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.
AI insight: Speed comes from custom LPU silicon, not GPUs — which is why it serves open models at hundreds of tokens/sec on an OpenAI-compatible API.
- inference
- low-latency
- lpu
- open-weights
View LM Studio details InferenceFREELMLM Studio
LM Studio
Desktop app to discover, download, and run local LLMs privately.
A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.
AI insight: Free even for commercial use, though the app itself is closed-source — and it serves both OpenAI- and Anthropic-compatible local APIs.
- local
- llm-runner
- gui
- privacy
View Ollama details InferenceFREEMIUMOpen coreOLOllama
Ollama
Run open-weight LLMs locally with one command. OpenAI-compatible API.
The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.
AI insight: Its OpenAI-compatible local server makes it a drop-in backend — point any app at localhost and swap the cloud for your own GPU.
- local
- open-source
- llm-runner
- self-hosted
View OpenRouter details InferenceFREEMIUMOPOpenRouter
OpenRouter
One OpenAI-compatible API in front of 300+ models from every provider.
A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.
AI insight: Swap among 300+ models by changing one string, with automatic fallback if a provider is down — and one consolidated bill.
- gateway
- routing
- multi-model
- fallbacks
View Replicate details InferenceFREEMIUMREReplicate
Replicate
Run, fine-tune, and deploy thousands of open models via one API.
A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.
AI insight: Any model is a 'Cog' container behind one API, billed per second — the low-commitment way to ship a model you didn't train.
- model-hosting
- fine-tuning
- api
- open-source
View Fireworks AI details InferenceFREEMIUMFIFireworks AI
Fireworks AI
Fast inference + fine-tuning. Production deployments at scale.
Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.
AI insight: Runs open models on its own FireAttention serving stack for low latency, and covers vision and audio models, not just text.
- inference
- fine-tuning
- low-latency
- production
View Modal details InferenceFREEMIUMMOModal
Modal Labs
Serverless GPUs. Run training, inference, batch jobs from Python.
Define cloud workloads in Python, deploy with one command — GPU access on demand, fast cold starts, fair-share pricing. The default 'I need to fine-tune a model from a Jupyter cell' platform.
AI insight: You define GPU infra in Python decorators, not YAML or Dockerfiles — its fast cold starts make per-job GPU billing practical.
- gpu
- serverless
- python
- training