LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.
Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine: Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more. The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Typically finetunes of the base models below are supported as well. Instructions for adding support for new models: HOWTO-add-model.md After downloading a model, use the CLI tools to run it locally - see below. The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp: To learn more about model quantization, read this documentation For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/ If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT: The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example: The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum. Command-line completion is available for some environments. There was an error while loading. Please reload this page. There was an error while loading. Please reload this page.
Mentions (30d)
0
Reviews
0
Platforms
3
GitHub Stars
101,000
16,272 forks
Features
Industry
information technology & services
Employees
6,000
Funding Stage
Other
Total Funding
$7.9B
101,000
GitHub stars
20
npm packages
3
HuggingFace models
Brazil, Indonesia, Japan, Germany, and India fueled a massive surge in 2025, adding nearly 36 million new developers to GitHub. 🌏 India alone added 5.2 million. 🇮🇳
Brazil, Indonesia, Japan, Germany, and India fueled a massive surge in 2025, adding nearly 36 million new developers to GitHub. 🌏 India alone added 5.2 million. 🇮🇳
View originalThe GitHub Actions 2026 security roadmap covers three layers in a shift toward making secure behavior the default. Here’s what’s coming next, and when. ⬇️ https://t.co/kF69g47Z09
The GitHub Actions 2026 security roadmap covers three layers in a shift toward making secure behavior the default. Here’s what’s coming next, and when. ⬇️ https://t.co/kF69g47Z09
View originalResearch-Driven Agents: What Happens When Your Agent Reads Before It Codes
Coding agents working from code alone generate shallow hypotheses. Adding a research phase ( arxiv papers, competing forks, other backends) produced 5 kernel fusions that made https://github.com/ggml-org/llama.cpp CPU inference 15% faster. submitted by /u/Southern-Papaya [link] [comments]
View originalLayman: Agentic Insight and Oversight (same same but different)
What's the most common duplicate project on r/ClaudeAI? Usage trackers. What's the second most common? AI Monitors. Does Layman do those things? Yes, of course. So what makes it different? Layman's Dashboard, Flowchart, and Logs view (with Layman's Terms and Analysis examples) Like many similar tools, Layman runs as a web service in a container on your local machine. It installs hooks and accesses harness logs to "look over your shoulder," then leverages a secondary AI instance to help keep your multiple sessions, sub-agents, and alternate harnesses in line. So, short answer: Drift Monitoring. Repeatedly named as one of the most frustrating issues for heavy Claude Code users, Layman takes into account all user prompts issued to CC as well as current project and global CLAUDE.md instructions, and at configurable intervals scores the current degree of "drift" occurring from your goals and the rules you have established. You can optionally receive warning notifications or place a block when different thresholds are reached. Risk Analysis. Layman will classify all tool calls and operations with a "risk" level based on simple, consistent criteria (such as read-only, writing, modifying, network access, deletion, etc.) and can automatically analyze the AI agent's current intended action, the overall goal or purpose behind that intention, and summarize the safety and security implications at stake. Layman's Terms. The eponymous origin of the tool, offering a plain-language (and if possible non-technical) explanation of the purpose of any given tool call. It can summarize what was performed at the session level as well, helpful for later recall and understanding after some time has passed. Vibe coders aside, should a professional developer already have knowledge of what their tools are doing before they grant permission? Yes, of course, but when you are operating at scale and (say) that TypeScript project you are polishing needs to look up some JSON value and your AI agent writes a one-off Python script to parse it out, it can be helpful to have an "extra pair of eyes" taking a look before you effectively begin yet-another code review. Meanwhile, typical features you might come to expect are included, from Session Recording (opt-in is required first for data tracking and there is no telemetry to worry about), Bookmarking, and Search, PII filtering (including PATs and API keys), File and URL access tracking, and a handy Setup Wizard for helping get those hooks installed in the first place and walking you through configuration of core capabilities. Did I mention besides Claude Code it supports Codex, OpenCode, Mistral Vibe, and Cline (with more to come)? Whether using these for local agents or as an alternative when hitting session limits, Layman can monitor and track them all at once. But wait, doesn't a "secondary AI instance" just end up wasting tokens? My Precious? (erm...) Our precious, precious tokens? When session limits already hit so hard? It turns out these algorithms do not require nearly the level of "intelligence" you might desire for your planning and coding sessions themselves. Personally I keep an instance of Qwen3-Coder-Next running locally via llama.cpp server on my system's GPU to field those calls, with no discernible impact on system performance. And when a local LLM is not available, Haiku does the job excellently (now you have a reason to use it). You absolutely do not need to use anything more resource-intensive to get the job done. Now you have a complete picture. GitHub repository: https://github.com/castellotti/layman License: MIT submitted by /u/jigsaw-studio [link] [comments]
View originalEvery dev knows security debt piles up fast ... and every repo has a few hidden vulnerabilities. 😅 With GitHub Copilot CLI, you can automate your security triage right from the terminal: 🔍 Run a fu
Every dev knows security debt piles up fast ... and every repo has a few hidden vulnerabilities. 😅 With GitHub Copilot CLI, you can automate your security triage right from the terminal: 🔍 Run a full security scan 📌 Map findings to the OWASP Top 10 🗂️ Automatically bulk-open GitHub Issues Get started with a new and improved workflow. 👇 https://t.co/m5eGC6Ddrh
View originalThinking about speaking at a tech conference? 💭 We’d love to hear your story on the stage at #GitHubUniverse this year. Submit your session idea now: https://t.co/rmH7FiR2WZ https://t.co/e9lFnB0B2U
Thinking about speaking at a tech conference? 💭 We’d love to hear your story on the stage at #GitHubUniverse this year. Submit your session idea now: https://t.co/rmH7FiR2WZ https://t.co/e9lFnB0B2U
View originalLinus Torvalds wrote Git in just 10 days after Linux kernel developers lost access to their proprietary tool, BitKeeper, due to licensing disagreements. He went from solving a problem to revolutioniz
Linus Torvalds wrote Git in just 10 days after Linux kernel developers lost access to their proprietary tool, BitKeeper, due to licensing disagreements. He went from solving a problem to revolutionizing how software teams collaborate and develop.
View originalGit the full story here in last year's 20th anniversary Q&A with Linus Torvalds. 📖 https://t.co/qm3ybAT4kI
Git the full story here in last year's 20th anniversary Q&A with Linus Torvalds. 📖 https://t.co/qm3ybAT4kI
View originalPicture it: The year is 2005 and you decide to download a new distributed system called Git (on your Windows XP? MacOS X Tiger? Linux kernel 2.6.11?). Now it's 21 years later and you're hosting your
Picture it: The year is 2005 and you decide to download a new distributed system called Git (on your Windows XP? MacOS X Tiger? Linux kernel 2.6.11?). Now it's 21 years later and you're hosting your code on GitHub. You can thank this man. 👇🧵
View originalSerious question, Did a transformer(Claude) just describe itself, the universe and build itself Shannon limit architecture? or am I crazy?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/rank^s with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/n^s. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integra
View originalSerious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/ranks with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/ns. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integral of l
View originalWeb & game developers, this is your jam. Gamedev.js Jam 2026 is back. 🎮 🗓 April 13-26, 2026 🌐 Build an HTML5 game in 13 days 🏆 Prizes + expert feedback 💬 Active community + Discord Theme reve
Web & game developers, this is your jam. Gamedev.js Jam 2026 is back. 🎮 🗓 April 13-26, 2026 🌐 Build an HTML5 game in 13 days 🏆 Prizes + expert feedback 💬 Active community + Discord Theme revealed on day one. Ship something weird. Ship something fun. Just ship it. https://t.co/KPphbM8rTz
View originalGitHub Copilot cloud agent just got a lot more flexible ✨ You can now it use it to research, plan, and make code changes without needing to open a pull request first. https://t.co/zKQ4DeSiC3 https://
GitHub Copilot cloud agent just got a lot more flexible ✨ You can now it use it to research, plan, and make code changes without needing to open a pull request first. https://t.co/zKQ4DeSiC3 https://t.co/Soi08zV4XS
View originalReal-time accessibility checks directly in Nuxt DevTools are now a reality. Check out the Nuxt A11y module. It’s built on axe-core and scans your app as you navigate, highlighting WCAG issues directl
Real-time accessibility checks directly in Nuxt DevTools are now a reality. Check out the Nuxt A11y module. It’s built on axe-core and scans your app as you navigate, highlighting WCAG issues directly on the page with zero production impact. By helping teams catch issues early, this module has the potential to make countless Nuxt apps more inclusive by default. @timdamen_io explains how it works. ▶️
View originalSingle-prompt AI workflows often hit a performance plateau. Multi-agent systems can push past it, but they usually require a massive amount of setup. Squad, an open source project built on GitHub Cop
Single-prompt AI workflows often hit a performance plateau. Multi-agent systems can push past it, but they usually require a massive amount of setup. Squad, an open source project built on GitHub Copilot, initializes a preconfigured AI team directly inside your repo. Learn how to run multi-agent workflows that stay inspectable, predictable, and collaborative. https://t.co/1ewya9yPpC
View originalWriting code to automate your repo? Great. Writing Markdown to do it? Pretty sick. GitHub Agentic Workflows: now in technical preview. https://t.co/n9qDJI3JzE
Writing code to automate your repo? Great. Writing Markdown to do it? Pretty sick. GitHub Agentic Workflows: now in technical preview. https://t.co/n9qDJI3JzE
View originalRepository Audit Available
Deep analysis of ggerganov/llama.cpp — architecture, costs, security, dependencies & more
llama.cpp uses a subscription + tiered pricing model. Visit their website for current pricing details.
Key features include: Plain C/C++ implementation without any dependencies, Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks, AVX, AVX2, AVX512 and AMX support for x86 architectures, RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures, 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use, Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA), Vulkan and SYCL backend support, CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity.
llama.cpp has a public GitHub repository with 101,000 stars.
Based on 67 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.
Julien Chaumond
CTO at Hugging Face
2 mentions