A fast inference library for running LLMs locally on modern consumer-class GPUs - turboderp-org/exllamav2
There was an error while loading. Please reload this page. ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs. The official and recommended backend server for ExLlamaV2 is TabbyAPI, which provides an OpenAI-compatible API for local or remote inference, with extended features like HF model downloading, embedding model support and support for HF Jinja2 chat templates. See the wiki for help getting started. The dynamic generator supports all inference, sampling and speculative decoding features of the previous two generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and performs better anyway, see here.) The generator is explained in detail here. See the full, updated examples here. Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future, and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck: To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio on Windows). Also make sure you have an appropriate version of PyTorch, then run: A simple console chatbot is included. Run it with: To install the current dev version, clone the repo and run the setup script: This will install the "JIT version" of the package, i.e. it will install the Python components without building the C++ extension in the process. Instead, the extension will be built the first time the library is used, then cached in ~/.cache/torch_extensions for subsequent use. Either download an appropriate wheel or install directly from the appropriate URL: A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with: ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on performance. Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target average bitrate. In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, producing coherent and mostly stable output with 2.55 bits per weight
Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Features
Industry
information technology & services
Employees
6,000
Funding Stage
Other
Total Funding
$7.9B
20
HuggingFace models
Web & game developers, this is your jam. Gamedev.js Jam 2026 is back. 🎮 🗓 April 13-26, 2026 🌐 Build an HTML5 game in 13 days 🏆 Prizes + expert feedback 💬 Active community + Discord Theme reve
Web & game developers, this is your jam. Gamedev.js Jam 2026 is back. 🎮 🗓 April 13-26, 2026 🌐 Build an HTML5 game in 13 days 🏆 Prizes + expert feedback 💬 Active community + Discord Theme revealed on day one. Ship something weird. Ship something fun. Just ship it. https://t.co/KPphbM8rTz
View originalEvery dev knows security debt piles up fast ... and every repo has a few hidden vulnerabilities. 😅 With GitHub Copilot CLI, you can automate your security triage right from the terminal: 🔍 Run a fu
Every dev knows security debt piles up fast ... and every repo has a few hidden vulnerabilities. 😅 With GitHub Copilot CLI, you can automate your security triage right from the terminal: 🔍 Run a full security scan 📌 Map findings to the OWASP Top 10 🗂️ Automatically bulk-open GitHub Issues Get started with a new and improved workflow. 👇 https://t.co/m5eGC6Ddrh
View originalThinking about speaking at a tech conference? 💭 We’d love to hear your story on the stage at #GitHubUniverse this year. Submit your session idea now: https://t.co/rmH7FiR2WZ https://t.co/e9lFnB0B2U
Thinking about speaking at a tech conference? 💭 We’d love to hear your story on the stage at #GitHubUniverse this year. Submit your session idea now: https://t.co/rmH7FiR2WZ https://t.co/e9lFnB0B2U
View originalGit the full story here in last year's 20th anniversary Q&A with Linus Torvalds. 📖 https://t.co/qm3ybAT4kI
Git the full story here in last year's 20th anniversary Q&A with Linus Torvalds. 📖 https://t.co/qm3ybAT4kI
View originalLinus Torvalds wrote Git in just 10 days after Linux kernel developers lost access to their proprietary tool, BitKeeper, due to licensing disagreements. He went from solving a problem to revolutioniz
Linus Torvalds wrote Git in just 10 days after Linux kernel developers lost access to their proprietary tool, BitKeeper, due to licensing disagreements. He went from solving a problem to revolutionizing how software teams collaborate and develop.
View originalPicture it: The year is 2005 and you decide to download a new distributed system called Git (on your Windows XP? MacOS X Tiger? Linux kernel 2.6.11?). Now it's 21 years later and you're hosting your
Picture it: The year is 2005 and you decide to download a new distributed system called Git (on your Windows XP? MacOS X Tiger? Linux kernel 2.6.11?). Now it's 21 years later and you're hosting your code on GitHub. You can thank this man. 👇🧵
View originalWeb & game developers, this is your jam. Gamedev.js Jam 2026 is back. 🎮 🗓 April 13-26, 2026 🌐 Build an HTML5 game in 13 days 🏆 Prizes + expert feedback 💬 Active community + Discord Theme reve
Web & game developers, this is your jam. Gamedev.js Jam 2026 is back. 🎮 🗓 April 13-26, 2026 🌐 Build an HTML5 game in 13 days 🏆 Prizes + expert feedback 💬 Active community + Discord Theme revealed on day one. Ship something weird. Ship something fun. Just ship it. https://t.co/KPphbM8rTz
View originalGitHub Copilot cloud agent just got a lot more flexible ✨ You can now it use it to research, plan, and make code changes without needing to open a pull request first. https://t.co/zKQ4DeSiC3 https://
GitHub Copilot cloud agent just got a lot more flexible ✨ You can now it use it to research, plan, and make code changes without needing to open a pull request first. https://t.co/zKQ4DeSiC3 https://t.co/Soi08zV4XS
View originalReal-time accessibility checks directly in Nuxt DevTools are now a reality. Check out the Nuxt A11y module. It’s built on axe-core and scans your app as you navigate, highlighting WCAG issues directl
Real-time accessibility checks directly in Nuxt DevTools are now a reality. Check out the Nuxt A11y module. It’s built on axe-core and scans your app as you navigate, highlighting WCAG issues directly on the page with zero production impact. By helping teams catch issues early, this module has the potential to make countless Nuxt apps more inclusive by default. @timdamen_io explains how it works. ▶️
View originalSingle-prompt AI workflows often hit a performance plateau. Multi-agent systems can push past it, but they usually require a massive amount of setup. Squad, an open source project built on GitHub Cop
Single-prompt AI workflows often hit a performance plateau. Multi-agent systems can push past it, but they usually require a massive amount of setup. Squad, an open source project built on GitHub Copilot, initializes a preconfigured AI team directly inside your repo. Learn how to run multi-agent workflows that stay inspectable, predictable, and collaborative. https://t.co/1ewya9yPpC
View originalWriting code to automate your repo? Great. Writing Markdown to do it? Pretty sick. GitHub Agentic Workflows: now in technical preview. https://t.co/n9qDJI3JzE
Writing code to automate your repo? Great. Writing Markdown to do it? Pretty sick. GitHub Agentic Workflows: now in technical preview. https://t.co/n9qDJI3JzE
View originalGitHub powers your code, but it can also power your daily life. 🔋 Instead of downloading another productivity app, manage your tasks right where you already work: ✅ Issues for chores and bills 🏷️ L
GitHub powers your code, but it can also power your daily life. 🔋 Instead of downloading another productivity app, manage your tasks right where you already work: ✅ Issues for chores and bills 🏷️ Labels for priority and status 📊 Projects for your daily schedule Here’s how to set up your personal operating system. 👇 https://t.co/WsO5Zwpb5C
View original😬 We've all been there, right? Our latest episode of GitHub for Beginners is all about making sure your projects are secure. Check it out now. https://t.co/MRXVNnv1XD https://t.co/6hogZiyMo6
😬 We've all been there, right? Our latest episode of GitHub for Beginners is all about making sure your projects are secure. Check it out now. https://t.co/MRXVNnv1XD https://t.co/6hogZiyMo6
View original@photonstorm Congrats on the big 4.0 (and 13th next week) 🎉
@photonstorm Congrats on the big 4.0 (and 13th next week) 🎉
View originalNot sure where to start? Here’s a meal planning app @OliviaGuzzardo was able to build using the GitHub Copilot SDK. 🍽️ It’s now in public preview. Try it out and let us know what you create. 👇 http
Not sure where to start? Here’s a meal planning app @OliviaGuzzardo was able to build using the GitHub Copilot SDK. 🍽️ It’s now in public preview. Try it out and let us know what you create. 👇 https://t.co/RsPXSGHNM4 https://t.co/8K7WtmJSgV
View originalWhat have you been building with the GitHub Copilot SDK recently? 👀
What have you been building with the GitHub Copilot SDK recently? 👀
View originalRepository Audit Available
Deep analysis of turboderp/exllamav2 — architecture, costs, security, dependencies & more
ExLlamaV2 uses a tiered pricing model. Visit their website for current pricing details.
Key features include: New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API, Uh oh!, Method 1: Install from source, Method 2: Install from release (with prebuilt extension), Method 3: Install from PyPI, Conversion, Evaluation, Community.
Based on 55 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.