Agents AI

llama.cpp logo

llama.cpp

The engine behind local LLMs

Local LLM Tools
Visit website
Free tierFrom FreeGeorgi Gerganov / ggml-orgFounded 2023Reviewed Jun 2026
Read our hands-on review
Best Apps to Run Local LLMs (2026)

Our take

Our verdict

8.1/10

Open-source C/C++ inference engine with best-in-class quantization and the GGUF format that powers most local LLM apps across nearly any hardware.

Best for: Developers and power users who want maximum control and performance running open-weight models directly on their own hardware.

Overall score8.1/10
Capability9.0
Ease of use4.0
Value for money10.0
Reliability9.0
Support & docs8.0

Pros

  • Best-in-class quantization (from ~1.5-bit to 8-bit) lets large models run on limited RAM/VRAM
  • Runs on almost any hardware — CPU, Apple Metal, CUDA, ROCm, Vulkan, SYCL and more
  • MIT-licensed and battle-tested; it underpins Ollama, LM Studio, GPT4All and much of the ecosystem
  • Built-in llama-server provides a drop-in OpenAI-compatible API and a minimal web UI

Cons

  • Not beginner-friendly: getting full features often means building from source with a C++ toolchain
  • No real GUI — the bundled web UI is minimal; most end users rely on third-party front-ends
  • Requires GGUF models; using Hugging Face checkpoints means a conversion step
  • Performance tuning (quant level, GPU offload, context size) demands technical knowledge

Overview

llama.cpp is the open-source (MIT) C/C++ inference engine created by Georgi Gerganov in 2023 and now maintained under the ggml-org organization. It is the foundation of the local-LLM world: its quantization techniques and GGUF model format are what make it possible to run capable models on consumer CPUs, Apple Silicon and a huge range of GPUs — and tools like Ollama, LM Studio and GPT4All are built on it or its ecosystem. With roughly 118,000 GitHub stars, it is among the most-starred AI projects anywhere.

It is also the most technical option in this category. Full feature support generally means building from source, models must be in GGUF (often via a conversion step), and getting the best performance requires understanding quantization levels and GPU offloading. The included llama-server exposes an OpenAI-compatible API and a minimal web UI, but most people interact with llama.cpp indirectly, through a friendlier front-end. For developers who want maximum control and efficiency, nothing else matches its reach.

Key Benefits

  • Runs anywhere: One codebase targets CPUs, Apple Metal, CUDA, ROCm, Vulkan and more.
  • Extreme efficiency: Aggressive quantization fits large models into modest memory budgets.
  • Foundational and reliable: Powers most of the ecosystem and is heavily exercised in production.
  • Open and free: MIT-licensed with no restrictions on commercial use.

Use Cases

  1. Squeezing models onto limited hardware — Use low-bit quantization to run bigger models than VRAM would normally allow.
  2. Embedding inference in software — Integrate the engine or llama-server directly into an application.
  3. Custom or cutting-edge models — Convert and run new architectures before GUI tools support them.
  4. Benchmarking and tuning — Experiment with quantization and offload settings for best throughput.
Local LLM
Inference Engine
Open Source
GGUF
Quantization

Features

  • GGUF single-file model format designed for fast loading and broad compatibility
  • Integer quantization from ~1.5-bit to 8-bit plus float16/bfloat16
  • llama-server: lightweight HTTP server with an OpenAI-compatible API and web UI
  • Broad hardware backends: CUDA, Metal, HIP/ROCm, Vulkan, SYCL and more
  • Hybrid CPU+GPU inference, splitting layers when VRAM is insufficient
  • Wide architecture support: Llama, Mistral, Mixtral, Gemma, DeepSeek, Qwen, Phi and others
  • Multimodal support (image+text) via libmtmd
  • Conversion scripts to build GGUF files from Hugging Face checkpoints

Pricing

Free (Open Source)
$0
  • Full engine under MIT, including commercial use
  • llama-server, quantization tools and conversion scripts
  • All hardware backends included

Alternatives to llama.cpp