Skip to content

Inference · SambaNova Systems

SambaNova Cloud

Fast inference for open models on custom RDU chips.

FREEMIUMCloudWebAPI

Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.

Model support

Multi-model

  • Llama
  • DeepSeek
  • Qwen
  • gpt-oss

Serves open-weight models on RDU hardware via an OpenAI-compatible API.

Where it runs

  • Web
  • API

Tags

  • #inference
  • #fast-inference
  • #open-models
  • #rdu
Open SambaNova CloudDocsPricing
  • View Baseten details
    InferenceFREEMIUM

    Baseten

    Baseten

    Inference cloud for serving any AI model in production.

    Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.

    AI insight: Models use its open-source 'Truss' packaging and scale to zero, so you pay per minute of active compute, not for idle GPUs.

    • inference
    • model-serving
    • gpu
    • autoscaling
  • View Cerebras details
    InferenceFREEMIUM

    Cerebras

    Cerebras Systems

    Wafer-scale inference cloud for open models.

    Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.

    AI insight: Runs models on a single dinner-plate-sized wafer instead of GPU clusters, hitting ~2,000 tokens/sec where GPU clouds plateau far lower.

    • inference
    • fast-inference
    • wafer-scale
    • open-models
  • View Morph details
    InferenceFREEMIUM

    Morph

    Morph

    Fast models that apply AI code edits to files in milliseconds.

    Infrastructure for coding agents centered on Fast Apply, a specialized model that merges AI-generated edits into files at ~10,500 tokens/sec instead of full-file rewrites or brittle search-and-replace. Also serves WarpGrep code search, context compaction, and a model router via an OpenAI-compatible API. Used in production by JetBrains, Vercel, and Webflow.

    AI insight: Its Fast Apply model merges LLM code edits at ~10,500 tok/s — the dedicated write layer agents use instead of slow full-file rewrites.

    • code-editing
    • fast-apply
    • coding-agents
    • api
  • View fal details
    InferenceFREEMIUM

    fal

    fal

    Serverless inference API for image, video, audio, and 3D models.

    A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.

    AI insight: Specializes in generative-media latency — FLUX, Kling, Veo and 600+ media models — where general inference hosts focus on text.

    • generative-media
    • image-gen
    • video-gen
    • serverless
  • View Groq details
    InferenceFREEMIUM

    Groq

    Groq

    Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.

    GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.

    AI insight: Speed comes from custom LPU silicon, not GPUs — which is why it serves open models at hundreds of tokens/sec on an OpenAI-compatible API.

    • inference
    • low-latency
    • lpu
    • open-weights
  • View LM Studio details
    InferenceFREE

    LM Studio

    LM Studio

    Desktop app to discover, download, and run local LLMs privately.

    A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.

    AI insight: Free even for commercial use, though the app itself is closed-source — and it serves both OpenAI- and Anthropic-compatible local APIs.

    • local
    • llm-runner
    • gui
    • privacy
  • View Ollama details
    InferenceFREEMIUMOpen core

    Ollama

    Ollama

    Run open-weight LLMs locally with one command. OpenAI-compatible API.

    The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.

    AI insight: Its OpenAI-compatible local server makes it a drop-in backend — point any app at localhost and swap the cloud for your own GPU.

    • local
    • open-source
    • llm-runner
    • self-hosted
  • View OpenRouter details
    InferenceFREEMIUM

    OpenRouter

    OpenRouter

    One OpenAI-compatible API in front of 300+ models from every provider.

    A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.

    AI insight: Swap among 300+ models by changing one string, with automatic fallback if a provider is down — and one consolidated bill.

    • gateway
    • routing
    • multi-model
    • fallbacks
  • View Replicate details
    InferenceFREEMIUM

    Replicate

    Replicate

    Run, fine-tune, and deploy thousands of open models via one API.

    A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.

    AI insight: Any model is a 'Cog' container behind one API, billed per second — the low-commitment way to ship a model you didn't train.

    • model-hosting
    • fine-tuning
    • api
    • open-source
  • View Fireworks AI details
    InferenceFREEMIUM

    Fireworks AI

    Fireworks AI

    Fast inference + fine-tuning. Production deployments at scale.

    Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.

    AI insight: Runs open models on its own FireAttention serving stack for low latency, and covers vision and audio models, not just text.

    • inference
    • fine-tuning
    • low-latency
    • production