Skip to content

Inference · Groq

Groq

Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.

FREEMIUMCloudAPIWeb

GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.

Model support

Multi-model

  • Llama
  • DeepSeek
  • Qwen
  • Kimi K2
  • GPT-OSS

Open-weights catalog on Groq LPUs; OpenAI-compatible API.

Where it runs

  • API
  • Web

Tags

  • #inference
  • #low-latency
  • #lpu
  • #open-weights
Open GroqDocsPricing

Related in Inference

  • View Together AI details
    InferenceFREEMIUMVetted

    Together AI

    Together

    Fine-tuning + inference for open-weights models. Broad coverage.

    Hosted inference and fine-tuning across hundreds of open-weights models (Llama, Mistral, DeepSeek, Qwen, etc.). Strong pricing for inference-at-scale; LoRA + full fine-tuning supported.

    • inference
    • fine-tuning
    • open-weights
    • lora
  • View OpenRouter details
    InferenceFREEMIUM

    OpenRouter

    OpenRouter

    One OpenAI-compatible API in front of 300+ models from every provider.

    A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.

    • gateway
    • routing
    • multi-model
    • fallbacks
  • View Replicate details
    InferenceFREEMIUM

    Replicate

    Replicate

    Run, fine-tune, and deploy thousands of open models via one API.

    A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.

    • model-hosting
    • fine-tuning
    • api
    • open-source
  • View Fireworks AI details
    InferenceFREEMIUM

    Fireworks AI

    Fireworks AI

    Fast inference + fine-tuning. Production deployments at scale.

    Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.

    • inference
    • fine-tuning
    • low-latency
    • production