llama.cpp is the open-source (MIT) C/C++ inference engine created by Georgi Gerganov in 2023 and now maintained under the ggml-org organization. It is the foundation of the local-LLM world: its quantization techniques and GGUF model format are what make it possible to run capable models on consumer CPUs, Apple Silicon and a huge range of GPUs — and tools like Ollama, LM Studio and GPT4All are built on it or its ecosystem. With roughly 118,000 GitHub stars, it is among the most-starred AI projects anywhere.
It is also the most technical option in this category. Full feature support generally means building from source, models must be in GGUF (often via a conversion step), and getting the best performance requires understanding quantization levels and GPU offloading. The included llama-server exposes an OpenAI-compatible API and a minimal web UI, but most people interact with llama.cpp indirectly, through a friendlier front-end. For developers who want maximum control and efficiency, nothing else matches its reach.
Key Benefits
- Runs anywhere: One codebase targets CPUs, Apple Metal, CUDA, ROCm, Vulkan and more.
- Extreme efficiency: Aggressive quantization fits large models into modest memory budgets.
- Foundational and reliable: Powers most of the ecosystem and is heavily exercised in production.
- Open and free: MIT-licensed with no restrictions on commercial use.
Use Cases
- Squeezing models onto limited hardware — Use low-bit quantization to run bigger models than VRAM would normally allow.
- Embedding inference in software — Integrate the engine or llama-server directly into an application.
- Custom or cutting-edge models — Convert and run new architectures before GUI tools support them.
- Benchmarking and tuning — Experiment with quantization and offload settings for best throughput.