High-throughput and memory-efficient inference and serving engine for Large Language Models. Deploy AI faster with state-of-the-art performance.
Users of vLLM appreciate its integration support, such as the recent compatibility with Intel’s Arc Pro B70, indicating robust flexibility in use across hardware. However, detailed user reviews providing personal experiences or explicit details on the software's strengths or complaints were not prevalent. Pricing sentiments or discussions appear to be absent from social mentions, leaving the cost aspect unclear. Overall, the mentions suggest that vLLM is recognized within niche communities for specific functionalities, but its broader reputation and reception are not extensively covered in the available discussions.
Mentions (30d)
14
4 this week
Reviews
0
Platforms
2
GitHub Stars
74,806
14,991 forks
Users of vLLM appreciate its integration support, such as the recent compatibility with Intel’s Arc Pro B70, indicating robust flexibility in use across hardware. However, detailed user reviews providing personal experiences or explicit details on the software's strengths or complaints were not prevalent. Pricing sentiments or discussions appear to be absent from social mentions, leaving the cost aspect unclear. Overall, the mentions suggest that vLLM is recognized within niche communities for specific functionalities, but its broader reputation and reception are not extensively covered in the available discussions.
Features
Use Cases
Industry
information technology & services
Employees
32
2,937
GitHub followers
36
GitHub repos
74,806
GitHub stars
20
npm packages
4
HuggingFace models
Most of my Claude usage was on work that didn't need Claude. Cut my bill 60x on bulk tasks with a tiny side model.
I looked at what was actually eating my Claude usage and it was embarrassing. Classifying files. Reformatting json. Pulling fields out of text. Summarizing docs I was going to skim anyway. None of that needed Sonnet. All of it cost the same as the work that did. Tried the obvious fixes first. Switching to Haiku for simple stuff (still wasteful at volume). Tighter prompts (helps a little). /compact (delays the problem). None of it changed the shape of the spend. What actually worked: a small cheap model running as a side worker, with one rule in CLAUDE.md telling Claude not to do the mechanical stuff itself. The setup is one tool. Send it text, get text back. Claude calls it for the bounded mechanical work I'd review anyway. Default model is DeepSeek V4 Flash because it's cheap and has 1M context, but the endpoint is one config line and works with anything openai-compatible (local ollama, vllm, lm studio). **3 weeks of real usage:** - 217 mechanical calls offloaded - DeepSeek total spend: $0.41 - Same workload on Sonnet would have been roughly $7 The CLAUDE.md rule that actually works is negative framing. Not "use deepseek for X" but "do NOT use Claude for: json formatting, field extraction, file classification, summarization you will review anyway." Positive framing got ignored maybe 30% of the time. Deny list catches it. It's a supervised worker, not an agent. No tool calls, no file access, no chains. Latency 3-25s. You review the output. That's the whole shape. Repo with setup steps: https://github.com/arizen-dev/deepseek-mcp (MIT, Python 3.10+) Happy to answer questions about the routing rules or the model choice.
View originalMemory
Your explanation is largely correct. The reason “memory” has become the dominant systems problem for LLMs is that modern transformers are increasingly **memory-bandwidth bound**, not compute-bound. The key shift is this: Training large models was mostly about FLOPs. Serving large models at scale is increasingly about **moving KV cache data around fast enough**. A single token generation step only performs a relatively modest amount of math compared to the amount of KV data that must be fetched from memory every step. **Why this happens** During inference, every new token attends to all prior tokens. So for token t, the model needs access to all prior K/V tensors: \\text{KV Cache Size} \\propto 2 \\times L \\times S \\times H \\times d Where: L = layers S = sequence length H = attention heads d = head dimension The killer is the S term. As context grows: 8K → manageable 128K → huge 1M → infrastructure problem A 70B model with long context can require **hundreds of GBs** of KV cache across concurrent users. **Why bandwidth matters more than raw compute** Modern GPUs like the NVIDIA H100 or NVIDIA Blackwell can perform enormous amounts of compute. But every generated token requires: Loading KV cache from memory Running attention Writing updated KV back That means inference speed often depends more on: HBM bandwidth memory locality cache management than tensor core throughput. This is why: HBM3E NVLink unified memory memory compression have become strategic bottlenecks. **Why the KV cache can exceed model weights** Model weights are static. KV cache is dynamic and scales with: users context length output length batch size Example intuition: 70B model weights might occupy \~140 GB FP16 But serving thousands of users with long contexts can require **multiple TBs of KV cache** So operators increasingly optimize: cache reuse eviction paging quantization instead of just model size. **Why vLLM and PagedAttention mattered so much** Before systems like vLLM, memory fragmentation was catastrophic. PagedAttention essentially borrowed ideas from operating systems: divide KV into pages allocate dynamically avoid contiguous memory assumptions That dramatically improved: utilization batching throughput This was one of the biggest inference infrastructure breakthroughs of the last few years because it improved economics without changing the model itself. **The deeper issue: transformers scale poorly with context** Standard attention fundamentally has a retrieval problem: Each token potentially references every prior token. Even though compute optimizations exist, the architecture still requires huge memory movement. That’s why researchers are exploring: Grouped Query Attention (GQA) Multi-Query Attention (MQA) sliding window attention recurrent memory state-space models hybrid retrieval systems The industry increasingly believes: infinite-context transformers using naive KV scaling are economically unsustainable. **Why inference economics are now the focus** Training frontier models is expensive. But operating them continuously at global scale is potentially even larger economically. For many providers: inference cost dominates memory dominates inference cost That’s why companies across the stack are racing on memory: NVIDIA → HBM + NVLink + Grace AMD → MI300 unified memory Cerebras → wafer-scale SRAM Groq → deterministic low-latency SRAM-heavy architecture Marvell Technology → custom memory fabrics The bottleneck has shifted from: “Can we train bigger models?” to: “Can we serve them cheaply and fast enough?”
View originalLLMs are just giant probability machines pretending to think
It’s fascinating that simple mathematics between tokens can eventually become a machine that writes essays, code, poetry, and even reasoning. We usually think probability means uncertainty. But LLMs show something strange: If probability + context + mathematical matching are scaled enough, uncertainty itself starts producing intelligent looking outputs. To understand this better, I tried breaking down an LLM from first principles using only 4 tiny training sentences. Example: The boat floated down to the bank. The investor walked into the bank to open a new account. The fisherman walked along the bank to cast his net. The bank has a vault. Then I asked: “The investor walked to the bank to lock his money in …” Why does the model predict “vault” instead of river-related words? That single question reveals almost the entire architecture of modern LLMs. The most underrated concept here is the LM Head. Most explanations immediately jump into transformers and attention, but almost nobody explains that the LM Head is essentially a gigantic token vocabulary containing all possible next token candidates the model can output. So internally the model is basically solving: “Out of all known tokens, which one best matches this context mathematically?” Then different layers help solve that problem: Embeddings: convert words into mathematical vectors Positional encoding: preserves word order Attention layer: figures out which words are related to each other in context (“investor”, “money”, “bank” become strongly connected) https://preview.redd.it/wxmpf00g7t2h1.jpg?width=2299&format=pjpg&auto=webp&s=a214113263cf008a759740474fbda4e0b8394ba5 Feed forward neural networks: act somewhat like massive learned if/else decision systems refining patterns internally And finally the LM Head converts all of that into probabilities for the next token. What surprised me most is: There is no hidden magic moment where the AI “becomes conscious”. It’s an enormous probability engine continuously finding the best contextual token match from its vocabulary. I made a beginner-friendly walkthrough explaining this visually without unnecessary jargon. [https://www.youtube.com/watch?v=YTV5qUCpu2c](https://www.youtube.com/watch?v=YTV5qUCpu2c) Would genuinely love feedback from people learning transformers/LLMs from scratch.
View originalNuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]
Disclaimer: I work for Numind, the company behind this open-weight model We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs. Try it, we have a huggingface space that is completely free (you don't even have to sign-up): [https://huggingface.co/spaces/numind/NuExtract3](https://huggingface.co/spaces/numind/NuExtract3) If you ever used [NuMarkdown](https://huggingface.co/numind/NuMarkdown-8B-Thinking), NuExtract3 is the successor. There are some examples to guide you. Feel free to re-use this model for any task. https://preview.redd.it/pm2xbooyxn2h1.png?width=1672&format=png&auto=webp&s=1a8a7b262190c8325159496dae98c3d2dfab493c https://preview.redd.it/b5z7ylfzxn2h1.png?width=1758&format=png&auto=webp&s=a07b3abd6e5065c2635de047bdf154357f903e4c [](https://preview.redd.it/nuextract3-released-open-weight-4b-vlm-for-markdown-ocr-and-v0-cdflrhrexn2h1.png?width=1672&format=png&auto=webp&s=f5590cf684a45e4cf2fcd9b1e2929cba7146634e) [](https://preview.redd.it/nuextract3-released-open-weight-4b-vlm-for-markdown-ocr-and-v0-q3dn99ufxn2h1.png?width=1758&format=png&auto=webp&s=3c987fda617d23a6e51ea69c2f3746fff1a7e2a2) A few things it is designed for: * converting document images to Markdown * extracting structured data from documents using a target json template * handling tables, forms, and layout-heavy pages * working with both text and visual document inputs * serving as a local/open-weight alternative for document extraction pipelines It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way. It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere. We mostly tried vLLM, SGLang, llama.cpp. We have a blog post and a pretty decent model card: * [https://about.nuextract.ai/blog/nuextract-3-release](https://about.nuextract.ai/blog/nuextract-3-release) * [https://huggingface.co/numind/NuExtract3](https://huggingface.co/numind/NuExtract3) * [https://huggingface.co/collections/numind/nuextract3](https://huggingface.co/collections/numind/nuextract3) I'm currently writing a paper on this model so I'll post it as soon as it's accepted. It's not yet on Arxiv yet as it has been submitted in a peer-review journal/conference. I'll try to answer as many questions as possible if you have any. We would really appreciate feedback from the community. We also have a discord if you're interested [https://discord.com/invite/3tsEtJNCDe](https://discord.com/invite/3tsEtJNCDe)
View originalSo, what is Yann LeCun's "World Models" and JEPA and is it Really a Replacement for LLMs?
A bit late to this as [the white paper hit arXiv](https://arxiv.org/abs/2603.19312) a little less than two months ago, but nobody else here mentioned it so I thought I might. A little background. Yann LeCun is a pioneer of deep learning and convolutional neural networks, LeCun served as Director of AI Research at Meta (formerly Facebook) and Chief AI Scientist, before leaving Meta ([under "interesting" ](https://www.businessinsider.com/yann-lecun-alexandr-wang-criticism-inexperienced-meta-ai-future-2026-1)[circumstances](https://www.businessinsider.com/yann-lecun-alexandr-wang-criticism-inexperienced-meta-ai-future-2026-1)) and becoming Executive Chairman of Advanced Machine Intelligence (AMI Labs) in 2025. He shared the 2018 ACM Turing Award for his foundational contributions to artificial intelligence. The "LeWorldModel," as described in the arXiv paper, doesn't appear to be [a "replacement" for LLMs](https://www.youtube.com/watch?v=6uW_GZdX1rU&t=67s). There's a lot of confusion about that in the AI field. [In interviews](https://www.youtube.com/watch?v=ngBraLDqzdI&t=357s) Yann made it very clear that he believes LLMs still serve a valuable function. It's not a binary choice. Anyways, from what I am seeing, the JEPA model is not optimized for language, but for [AI needing visual processing](https://arxiv.org/abs/2506.09985) such as robotics, self driving, and industrial controls. JEPA isn't processing language like an LLM. It's processing pixels. Anyways, wondering if anyone else had thoughts here and/or disagree.
View originalBuild agentic orchestrators in minutes NOT months.
Some of you might remember BoneScript, my LLM friendly declarative backend compiler. MarrowScript is the next version and the big addition is a full LLM harness built into the language itself. The problem I kept running into: every project that calls an LLM ends up with the same pile of glue code. Retry logic, response validation, caching, cost tracking, provider switching, confidence routing. You write it once, copy it to the next project, tweak it, and it slowly rots. None of it is your actual product logic but it takes up half your backend. So I made it declarative. In MarrowScript you declare your models, prompts, and routers as first-class concepts in the spec file. The compiler generates all the infrastructure around them. What that looks like in practice: You declare a model. Provider, endpoint, context window, cost class. Works with any OpenAI-compatible endpoint. LM Studio, Ollama, vLLM, OpenRouter, whatever you're running locally. You declare a prompt. Input types, output type, which model to use, validation mode, what to do when validation fails, retry policy, cache TTL. The compiler generates a typed function you call from your routes. Under the hood it handles retries, caches responses in Postgres, validates the output against your schema, and if validation fails it can automatically fire a repair prompt to fix the response. You declare a router. It picks which model to use based on input characteristics. Short simple inputs go to your tiny local model. Complex inputs escalate to something bigger. Confidence thresholds control when to retry or escalate. ***All deterministic at compile time.*** Some examples of what it generates: * Provider adapters for openai\_compat, ollama, llamacpp, koboldcpp, and raw http * SSRF protection on all outbound LLM calls (allowlist-based, blocks private ranges by default) * Prompt cache backed by Postgres with configurable TTL * Per-trace and per-tenant token/cost budgets with hard cutoffs * Cognition traces stored in Postgres (or in-memory for dev) with OTLP export * Response validation (schema check or full AST compilation check for code generation) * Repair prompts that fire automatically when validation fails * Confidence scoring from logprobs (on providers that support it) * A CLI command to convert recorded traces into regression tests The part I'm most interested in feedback on is the router concept. Right now it's a static decision tree. You set thresholds at compile time based on an input metric. There's a `marrowc tune-router` command that reads recorded traces and tells you if your thresholds are wrong, but it doesn't auto-rewrite them yet. The whole thing is designed around local-first inference. The default setup in the examples uses LM Studio on the LAN as the primary model and OpenRouter as the escalation tier. Most requests stay local and free. Only the ones that fail confidence checks hit the paid API. It's on GitHub and npm. The compiler is TypeScript, runs on Node 18+. There's a VS Code extension you can compile and edit to your needs. What I want to know: for those of you running local models in production or semi-production, what's the infrastructure pain that eats the most time? Is it the retry/validation loop? Cost tracking? Provider switching? Something else entirely?
View originall9gpu - open-source GPU observability with workload-level attribution [P]
GPU monitoring tools like DCGM give you hardware-level metrics but no workload context. When a node is saturated, you can't tell which experiment, team, or job is responsible without digging through logs. We built l9gpu to close that gap. It's a node-level agent that exports GPU metrics via OTLP with workload attribution embedded: - Kubernetes: correlates GPU metrics with pod, namespace, and deployment - Slurm: correlates with job ID, user, and partition - LLM inference: native metrics for vLLM, SGLang, and TGI - Hardware: NVIDIA, AMD MI300X, Intel Gaudi - 17 pre-built Prometheus alert rules + Grafana dashboards Derived from Meta's gcm project, extended with K8s attribution, multi-vendor GPU support, and OTLP export. MIT licensed. https://github.com/last9/gpu-telemetry Happy to discuss design decisions around the attribution mapping. What is the ML infra community using for GPU cost visibility in shared research clusters? submitted by /u/bakibab [link] [comments]
View originalWe built a tool that installs frameworks like ComfyUI, Ollama, OpenWebUI etc on any cloud GPU in one command and saves your whole setup between sessions [R]
We kept running into the same problem every time we rented a GPU to run Ollama + OpenWebUI or ComfyUI, we'd spend the first 45 minutes reinstalling everything. Custom nodes, models, configs, all of it. Docker images went stale fast, different providers had different base images, and nothing was truly portable. We got sick of it and built swm. Here's what it does for ComfyUI users specifically: swm gpus -g a100 --max-price 2.00 --sort price shows you the cheapest available GPU across RunPod, Vast ai, Lambda, and 7 other providers in one view swm pod create — spins up an instance on whatever provider you pick swm setup install comfyui — installs ComfyUI on the pod From there the main thing is the workspace sync. Your entire setup custom nodes, models, outputs, configs lives in S3-compatible object storage (I use B2). When you're done you run swm pod down and it pushes everything, kills the instance, and next time you spin up on any provider you just pull and everything is exactly where you left it. No more reinstalling 15 custom nodes and redownloading checkpoints every session. We also built a lifecycle guard because we kept falling asleep mid-session and waking up to dumb bills. It watches GPU utilization and if nothing's happening for 30 minutes (configurable), it saves your workspace and terminates automatically. Has saved us more money than we want to admit lol. A few other things: Background auto-sync daemon pushes changes every 60 seconds so you don't have to remember to save Tar mode for huge workspaces with tons of small files packs everything into one S3 object instead of 600k individual uploads Also supports vLLM, Ollama, Open WebUI, SwarmUI, and Axolotl if you do more than SD Works with Cursor, Claude Code, Codex, Windsurf if you want your AI agent to manage GPU instances for you Free, open source, Apache 2.0. pipx install swm-gpu Site: https://swmgpu.com GitHub: https://github.com/swm-gpu/swm Would love feedback from anyone who rents GPUs. What's the most annoying part of your current workflow? We are also looking for contributors to the open source repo and suggestions on new frameworks/extensions to be included. Please share your thoughts submitted by /u/Tkpf18 [link] [comments]
View originalHow I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
Long-time lurker first time posting. Hey everyone! So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever. This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process. That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one. I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into: \- Tool names sit in the model context, so the model can guess or forge them \- "Dangerous mode" is one config flag away from default \- Memory management has no concept of instruction priority \- The audit story is mostly "the model thought it should" I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one. So I made it myself. CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis: The LLM never holds the security boundary. **What that means in code:** **Capability ID indirection.** The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap\_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names. **Effect classes.** Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass. **IRONCLAD\_CONTEXT.** Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them. **Tamper-evident audit chain.** Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too. **Streaming output leak filter.** Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client. **No YOLO mode.** There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded. The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is. I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries. I added in 30+ built-in t
View originalLLM-Rosetta — format conversion library across LLM API standards, doubles as a proxy
This started because we had a proprietary internal LLM API that spoke none of the standard formats. Built an internal conversion layer to bridge it, maintained that for over a year. As colleagues started adopting more and more coding tools — Claude Code, opencode, Codex, VS Code plugins, Goose, and whatever came out that week — each with its own API format expectations, maintaining separate adapters for each became the actual problem. That's what pushed the internal conversion layer into a proper generalized design, and llm-rosetta is the result. It's a Python library that converts between LLM API formats — OpenAI Chat, Responses/Open Responses, Anthropic, and Google GenAI. The idea is you convert through a shared IR so you don't end up writing N² adapters. The key difference from LiteLLM: LiteLLM is a unified calling layer that takes OpenAI-style input and transforms it into provider-native requests — one direction. llm-rosetta uses a hub-and-spoke IR, so each provider only needs one converter, and you get any-to-any conversion for free. Anthropic → Google, OpenAI Chat → Anthropic, whatever direction you need. Use it as a library — `pip install` and call `convert()` directly, no server needed. Or run the gateway if you want a proxy that handles the format translation for you. Zero required runtime dependencies either way. The HTTP server, client, and persistence layer are vendored from zerodep ([https://github.com/Oaklight/zerodep](https://github.com/Oaklight/zerodep)), another project of mine — stdlib-only single-file modules, not someone else's library repackaged. The gateway ships with a Docker image if you'd rather not deal with Python env setup. You can also deploy it on HuggingFace Spaces or anything similar — admin panel, dashboard, request log, config management all included. Screenshots: [https://llm-rosetta.readthedocs.io/en/latest/gateway/admin-panel/](https://llm-rosetta.readthedocs.io/en/latest/gateway/admin-panel/) We've been running it in production for about 5 months as the conversion layer for an internal multi-model access platform — needed to support various API standards and coding tool integrations before the upstream APIs were fully standardized. The Responses converter passes all 6 official Open Responses compliance tests (schema + semantic) from the spec repo. So if you're running Ollama, vLLM, or LM Studio with Responses endpoints, it should just work as one side of the conversion. There's a shim layer for provider-specific quirks — built-in shims for OpenRouter, DeepSeek, Qwen, xAI, Volcengine, etc. Converters stay generic per API standard, shims handle the edge cases declaratively. 24 cross-provider examples in the repo covering all provider pairs, SDK + REST, streaming, tool calls, image inputs, multi-turn with provider switching mid-conversation. * GitHub: [https://github.com/Oaklight/llm-rosetta](https://github.com/Oaklight/llm-rosetta) * Docs: [https://llm-rosetta.readthedocs.io](https://llm-rosetta.readthedocs.io) * arXiv: [https://arxiv.org/abs/2604.09360](https://arxiv.org/abs/2604.09360) * Gateway screenshot: https://preview.redd.it/qzzjr2dcdw1h1.png?width=949&format=png&auto=webp&s=bce4293aae81059f794909fc37f85071cee34378
View originalSharing all KGC 2026 decks. More production-grade KG systems than I've seen at any conference. [D]
Didn't make it to New York for the Knowledge Graph Conference this year, but caught some talks virtually and managed to download all the decks. Sharing them below because some of what was shown is worth knowing about. Majority of the presentations described live production systems. Enterprises showing up with real engineers delivering real compliance requirements. That's not usual for most ai eventss. Most talks are proofs of concept with a "coming soon to prod" slide at the end. For eg - Bloomberg showed a formal dependency model for ontology governance. AbbVie walked through ARCH, their internal KG for drug and disease-area intelligence, connected to a scoring engine, a researcher dashboard, and an LLM companion for plain-language queries. The KG is the source of truth. The LLM is the interface. Even Morgan Stanley showed continuous SHACL drift detection on risk reporting data - automated weekly checks that alert when the semantic layer deviates from what's governed. Crux: knowledge graphs are being actively used as infrastructure, not a retrieval layer on top of vectors. The graph is doing reasoning work, not lookup work. We've been skeptical of the "only using vector dbs" framing for a while. These production systems are the clearest evidence I've seen of where that breaks down - and what the alternative actually looks like when it's running. Link to the all the decks in the comment. All decks here: [https://drive.google.com/drive/folders/1Csdv4hZePrBMJGggsisPXYBueTRCK1kV?usp=sharing](https://drive.google.com/drive/folders/1Csdv4hZePrBMJGggsisPXYBueTRCK1kV?usp=sharing)
View originalA year of using LLMs for DSP/algorithms research: Techniques I've landed on, curious what others are doing
I've spent the last year using coding LLMs daily for DSP and algorithms research, and the workflow that's emerged is meaningfully different from regular software development. Sharing what's worked and hoping to hear what others are doing. I'm sure people have approaches I haven't thought of. Let me run down my high-level categories and then I'll focus on one of them here: * Maintain a problem_description.md file * Write regular reports in both .md and .pdf, about 2-5 per day * Create a Human -> LLM Coding App -> Human -> LLM Chat App Loop * Increase your report quality with exec summary, plot interpretation descriptions, etc. * Develop an Ongoing GUI * Don't let the LLM be dramatic (this one might save your sanity after long sessions) * Share reports with co-workers Here I'm going to focus on "Developing an Ongoing GUI." The rest of the topics are in a video I recorded, listed at the end: In a nutshell, start by telling your app to make a simple GUI for you that lets you browse your data folders and make plots that are generic at first, but then get highly customized over time. This is high value for researchers because good GUI programming takes a long time to learn and execute. Instead, coding LLMs can do that stuff very quickly without taking your mind of your main topic. Basically, as you're doing your work, examining data, etc., you'll want a quick way to view/visualize and analyze it. The easiest thing is for your coding LLM to make a program for you that browses folders and makes plots...and then to build on it day-by-day from there! For example, beyond basic plots, you may routinely do spectrograms and FFTs. Or you might convert data into the theta/angle domain. Each time you have your coding LLM do an action like that and it seems like something you'll want to do again in the near future, just tell it, "Please add a tab to my GUI that does it." It's that simple! And here are some tips to make good graphs. Tell your coding LLM to make your app: * Sync all X and Y axes * Start all plots zoomed in so that it fills 85% of the vertical space * Make all plots with similar units share the same range These make it much easier to make comparisons when all of your axes are the same scale and you can pan and browse them together. Once you've got your GUI going, you can also tell your coding LLMto improve it with a prompt like, "Remember that plot we added to the "MCAP Analyzer" tab that performs the full analysis? Please make a second button below it named "Extract" that only extracts the load cell values." Or "When you plot the load cell signal, highlight the 2-4 Hz range." You will be nicely pleased on how the benefits of making a bespoke app compound. Something you did 2 weeks ago or even a month ago will quickly be at your fingertips, without having to interrupt your sessions, start a new session, or pay for your coding LLM to re-compute it! One more tip: In addition to plotting the data on the screen, ask your LLM to make your app write the key values from plot into a .csv or .json file or even "make a textual description of each step of the analysis." That will make it easy to paste into other programs/software to analyze. After a few months, you will have quite the Swiss Army Knife of analysis tools! Hell, you can just paste this whole entire post into your LLM coding tool and it will know what to do. One last tip on the nuts and bolts: I recommend using python and the vispy library with TKinter widgets. This gives a cross-platform combo that uses the GPU for fast graphics updates. Matplotlib is okay, too; it's slower but has better zoom tools. Even if you don't have any idea what that means, just paste it into your coding LLM and it will know what to do! Lastly, I put together a 27-minute talk on this topic with *7 more sections*. As i mentioned, I made this post and video to share and to learn from other people what kinds of techniques I'm missing? I am especially interested in: * How to share LLM coded program with other people in my group (without tons of code reviews, etc.) * How to use databases on large shared drives (My drive is a CIFS NAS which is terrible for DBs) * How to get the LLMs to think out of the box...I 've found sometimes I can spend days (or longer!) figuring out some technique only to realize I've been re-inventing the wheel :( * What other tools to connect to my main LLM coding app to multiply its power My full vid: https://www.youtube.com/watch?v=nOU9nOZ_res
View originalIs this as unnerving as it sounds?
I was watching Andrej Karpathy's excellent "Intro to Large Language Models" just now, and in the "how do they work" section, he explains that while we know exactly how the LLM is trained by iterative updates, we don't understand why certain circuits emerge or why the parameter structures end up the way they do. i.e. there is highly complex emergent learning going on by this optimization of parameter relationships but we don't know how the LLM does it or why. This is apparently a well known problem in the AI space. To my untrained ear, this sounds like a red flag. It should be fully understood before we go any further. Here's the video: [https://www.youtube.com/watch?v=zjkBMFhNj\_g](https://www.youtube.com/watch?v=zjkBMFhNj_g)
View originalMost of my Claude usage was on work that didn't need Claude. Cut my bill 60x on bulk tasks with a tiny side model.
I looked at what was actually eating my Claude usage and it was embarrassing. Classifying files. Reformatting json. Pulling fields out of text. Summarizing docs I was going to skim anyway. None of that needed Sonnet. All of it cost the same as the work that did. Tried the obvious fixes first. Switching to Haiku for simple stuff (still wasteful at volume). Tighter prompts (helps a little). /compact (delays the problem). None of it changed the shape of the spend. What actually worked: a small cheap model running as a side worker, with one rule in CLAUDE.md telling Claude not to do the mechanical stuff itself. The setup is one tool. Send it text, get text back. Claude calls it for the bounded mechanical work I'd review anyway. Default model is DeepSeek V4 Flash because it's cheap and has 1M context, but the endpoint is one config line and works with anything openai-compatible (local ollama, vllm, lm studio). **3 weeks of real usage:** - 217 mechanical calls offloaded - DeepSeek total spend: $0.41 - Same workload on Sonnet would have been roughly $7 The CLAUDE.md rule that actually works is negative framing. Not "use deepseek for X" but "do NOT use Claude for: json formatting, field extraction, file classification, summarization you will review anyway." Positive framing got ignored maybe 30% of the time. Deny list catches it. It's a supervised worker, not an agent. No tool calls, no file access, no chains. Latency 3-25s. You review the output. That's the whole shape. Repo with setup steps: https://github.com/arizen-dev/deepseek-mcp (MIT, Python 3.10+) Happy to answer questions about the routing rules or the model choice.
View originalclaudely: launch Claude Code against Local LLM provider like LM Studio / Ollama / llama.cpp without trashing your real claude config
Plenty of CLI coding agents will talk to a local LLM, but the catch is the ecosystem. Skills, slash commands, MCP servers, plugins, hooks: all the interesting tooling has been built specifically for Claude Code, and parity on every other agent is patchy at best. Trying to reuse a Claude-shaped workflow on a different agent quickly turns into "rewrite all the plugins" or "do without." claudely skips that fight. You keep Claude Code as the client (and its whole plugin / skill / MCP ecosystem with it), and just point it at a model running on your own hardware. Pick a provider, claudely spawns \`claude\` with the right base URL, auth, and cache fix wired up for that one session. Your shell and the regular \`claude\` command stay untouched, so you can flip between local and the real Anthropic API without thinking about it. It also quietly fixes a prompt-cache bug that otherwise tanks local-model speed by \~90%, and handles the per-provider env-var differences for you. Works with LM Studio, Ollama, llama.cpp, or any Anthropic-compatible endpoint (point it at a litellm or claude-code-router proxy for OpenAI-protocol backends like vLLM). npm i -g claudely claudely # LM Studio, picker over your downloaded models claudely -p ollama -m gpt-oss:20b # Ollama, skip the picker claudely -p llamacpp # whichever GGUF llama-server is serving MIT, Node 20+, unaffiliated community helper. Built with Claude Code's help, fittingly. Feedback welcome. Repo: [https://github.com/mforce/claudely](https://github.com/mforce/claudely) NPM: [https://www.npmjs.com/package/claudely](https://www.npmjs.com/package/claudely)
View originalLLM proxy that lets Claude Code talk to any model
I built **rosetta-llm** — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. * Works as a Claude Code LLM gateway — set \`ANTHROPIC\_BASE\_URL\` and all configured models appear in \`/model\` picker * Translates between formats — Anthropic Messages ↔ OpenAI Chat ↔ OpenAI Responses at the wire level * Thinking blocks round-trip correctly — this is the hard part and why I built this * Provider routing — \`openai/gpt-5.4\`, \`anthropic/claude-opus-4-7\`, \`groq/llama-4\` all through one endpoint * Streaming on everything — passthrough fast path + cross-format translation with proper SSE handling # The thinking-block problem Most proxies lose reasoning continuity. LiteLLM has had open PRs for thinking block handling for a long time — some dating back months — and they're still not merged. Without proper round-tripping, prompt caching breaks across turns and Claude Code loses context. Rosetta encodes encrypted reasoning into Anthropic's \`signature\` field and decodes it back — so multi-turn agentic workflows keep their prompt-cache hits. # Zero-setup Hugging Face Space Literally a two-line Dockerfile: FROM ghcr.io/lokesh-chimakurthi/rosetta-llm:latest COPY --chown=app:app config.json /app/config.json Add config.json file and above Dockerfile into a HF Space (Docker SDK) and it's running. No clone, no build, no venv. The GHCR image has everything baked in. Make your HF space private and add api keys in hf space secrets. Check readme in github # Also works with # No install — ephemeral uvx rosetta-llm # Persistent install uv tool install rosetta-llm rosetta-llm --config ~/.rosetta-llm/config.json # Docker docker run -p 7860:7860 \ -v ~/.rosetta-llm/config.json:/app/config.json \ ghcr.io/lokesh-chimakurthi/rosetta-llm:main # Why another proxy? I looked at existing solutions: * LiteLLM — thinking block round-trip PRs going nowhere, too many abstractions * OpenRouter — great but closed-source, no self-hosting * Direct passthrough proxies — don't translate between formats Nothing gave me lossless cross-format translation with proper reasoning fidelity. # Links * GitHub: [https://github.com/Lokesh-Chimakurthi/rosetta-llm](https://github.com/Lokesh-Chimakurthi/rosetta-llm) * PyPI: [https://pypi.org/project/rosetta-llm/](https://pypi.org/project/rosetta-llm/) # Contributions welcome I built this for myself and it works for my use cases. But there's a lot more it could do — better multimodal handling, embeddings support, rate limiting, an admin UI. If any of this sounds interesting, PRs are absolutely welcome. Happy to answer questions in the comments.
View originalRepository Audit Available
Deep analysis of vllm-project/vllm — architecture, costs, security, dependencies & more
vLLM uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Cash Donations, Compute Resources, Slack Sponsor, Hardware, Open Models, Recipes, Performance, Roadmap.
vLLM is commonly used for: Real-time text generation for chatbots, Content creation for marketing, Automated customer support responses, Code generation and debugging assistance, Data analysis and report generation, Personalized recommendations in e-commerce.
vLLM integrates with: Slack, Discord, Microsoft Teams, Zapier, AWS Lambda, Google Cloud Functions, Kubernetes, Docker, Jupyter Notebooks, FastAPI.
vLLM has a public GitHub repository with 74,806 stars.
Ollama
Project at Ollama
3 mentions
Based on user reviews and social mentions, the most common pain points are: cost tracking, cost visibility, token cost.
Based on 29 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.