Inference · SambaNova Systems

SambaNova Cloud

Fast inference for open models on custom RDU chips.

FREEMIUMCloudWebAPI

Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.

Model support

Multi-model

Llama
DeepSeek
Qwen
gpt-oss

Serves open-weight models on RDU hardware via an OpenAI-compatible API.

Where it runs

Tags

#inference
#fast-inference
#open-models
#rdu

Open SambaNova Cloud Docs Pricing

View Baseten details
InferenceFREEMIUM
Baseten
Baseten
Inference cloud for serving any AI model in production.
Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.
AI insight: Models use its open-source 'Truss' packaging and scale to zero, so you pay per minute of active compute, not for idle GPUs.
- inference
- model-serving
- gpu
- autoscaling
Open
View Cerebras details
InferenceFREEMIUM
Cerebras
Cerebras Systems
Wafer-scale inference cloud for open models.
Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.
AI insight: Runs models on a single dinner-plate-sized wafer instead of GPU clusters, hitting ~2,000 tokens/sec where GPU clouds plateau far lower.
- inference
- fast-inference
- wafer-scale
- open-models
Open
View Morph details
InferenceFREEMIUM
Morph
Morph
Fast models that apply AI code edits to files in milliseconds.
Infrastructure for coding agents centered on Fast Apply, a specialized model that merges AI-generated edits into files at ~10,500 tokens/sec instead of full-file rewrites or brittle search-and-replace. Also serves WarpGrep code search, context compaction, and a model router via an OpenAI-compatible API. Used in production by JetBrains, Vercel, and Webflow.
AI insight: Its Fast Apply model merges LLM code edits at ~10,500 tok/s — the dedicated write layer agents use instead of slow full-file rewrites.
- code-editing
- fast-apply
- coding-agents
- api
Open
View fal details
InferenceFREEMIUM
fal
fal
Serverless inference API for image, video, audio, and 3D models.
A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.
AI insight: Specializes in generative-media latency — FLUX, Kling, Veo and 600+ media models — where general inference hosts focus on text.
- generative-media
- image-gen
- video-gen
- serverless
Open
View Groq details
InferenceFREEMIUM
Groq
Groq
Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.
GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.
AI insight: Speed comes from custom LPU silicon, not GPUs — which is why it serves open models at hundreds of tokens/sec on an OpenAI-compatible API.
- inference
- low-latency
- lpu
- open-weights
Open
View LM Studio details
InferenceFREE
LM Studio
LM Studio
Desktop app to discover, download, and run local LLMs privately.
A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.
AI insight: Free even for commercial use, though the app itself is closed-source — and it serves both OpenAI- and Anthropic-compatible local APIs.
- local
- llm-runner
- gui
- privacy
Open
View Ollama details
InferenceFREEMIUMOpen core
Ollama
Ollama
Run open-weight LLMs locally with one command. OpenAI-compatible API.
The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.
AI insight: Its OpenAI-compatible local server makes it a drop-in backend — point any app at localhost and swap the cloud for your own GPU.
- local
- open-source
- llm-runner
- self-hosted
Open
View OpenRouter details
InferenceFREEMIUM
OpenRouter
OpenRouter
One OpenAI-compatible API in front of 300+ models from every provider.
A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.
AI insight: Swap among 300+ models by changing one string, with automatic fallback if a provider is down — and one consolidated bill.
- gateway
- routing
- multi-model
- fallbacks
Open
View Replicate details
InferenceFREEMIUM
Replicate
Replicate
Run, fine-tune, and deploy thousands of open models via one API.
A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.
AI insight: Any model is a 'Cog' container behind one API, billed per second — the low-commitment way to ship a model you didn't train.
- model-hosting
- fine-tuning
- api
- open-source
Open
View Fireworks AI details
InferenceFREEMIUM
Fireworks AI
Fireworks AI
Fast inference + fine-tuning. Production deployments at scale.
Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.
AI insight: Runs open models on its own FireAttention serving stack for low latency, and covers vision and audio models, not just text.
- inference
- fine-tuning
- low-latency
- production
Open

Open SambaNova Cloud

Multi-model

Baseten

Cerebras

Morph

fal

Groq

LM Studio

Ollama

OpenRouter

Replicate

Fireworks AI