Automate time-consuming data engineering into one clear story, one user at a time.
Automate time-consuming data engineering into one clear story, one user at a time. Connecting your user’s needs with their purchase behaviour drives business value. 5% of Premium users churn due to lack of customization options 20% of churned customers mentioning business closure transferred money to another bank 5% of Premium users churn due to lack of customization options 20% of churned customers mentioning business closure transferred money to another bank 25% increase in inquiries about product returns, from Holiday Sales customers 18% of mobile app users mention difficulty finding the export data feature 25% increase in inquiries about product returns, from Holiday Sales customers 18% of mobile app users mention difficulty finding the export data feature 45% of users who abandon signup cite concerns about data security 38% of patients mentioning nausea in initial consultations don’t convert to paid treatment plans 45% of users who abandon signup cite concerns about data security 38% of patients mentioning nausea in initial consultations don’t convert to paid treatment plans 17% of Detractors mention high fees Promoters main driver with 25% is the app’s budgeting features 17% of Detractors mention high fees Promoters main driver with 25% is the app’s budgeting features 32% of basic plan users ask about advanced analytics features 25% of loan customers who talked about travel acquired the rewards credit card 32% of basic plan users ask about advanced analytics features 25% of loan customers who talked about travel acquired the rewards credit card Lang.ai is built to connect unstructured complex data with metrics that drive revenue, optimizing for the variables that matter for your business. Our advanced AI pipeline accurately contextualizes the insights with real Snowflake data based on the AI agent goal and the relevance according to your feedback. Process Enterprise Data volumes without hitting LLM context window limits, all within the security perimeter of Snowflake. Lang.ai makes getting user insights quick with easy setup, understandable AI agents, and actionable insights. Zero hassle. Create custom agents that derive actionable priorities from customer interactions, driving retention and growth. Our Snowflake native app collects insights from all customer touchpoints, providing a complete view rather than fragmented data. Adapt your view as priorities change, ensuring you always see what's most relevant for you and your company. Deliver Lang's insights directly to your team via Slack. Keep everyone aligned on customer-driven priorities in real-time. “This innovative approach is a game-changer for companies looking to analyze both structured and unstructured data sources. While much of the industry is focused on creating chatbots that mimic human agents, Lang does it right.” “The AI agent from Lang has transformed our team's approach to data. We integrate it into our weekly reviews to ident
Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Features
Industry
information technology & services
Employees
15
Funding Stage
Series A
Total Funding
$14.5M
I compiled every major AI agent security incident from 2024-2026 in one place - 90 incidents, all sourced, updated weekly
After tracking AI agent security incidents for the past year, I put together a single reference covering every major breach, vulnerability and attack from 2024 through 2026. 90 incidents total, organized by year, with dates, named companies, impact, root cause, CVEs where applicable, and source links for every entry. Covers supply chain attacks (LiteLLM, Trivy, Axios), framework vulnerabilities (LangChain, Langflow, OpenClaw), enterprise incidents (Meta Sev 1, Mercor/Meta suspension), AI coding tool CVEs (Claude Code, Copilot, Cursor), crypto exploits (Drift Protocol $285M, Bybit $1.46B), and more. Also includes 20 sourced industry stats and an attack pattern taxonomy grouping incidents by type. No product pitches. No opinions. Just facts with sources. https://github.com/webpro255/awesome-ai-agent-attacks PRs welcome if I missed anything. submitted by /u/webpro255 [link] [comments]
View originalBurned 5B tokens with Claude Code in March to build a financial research agent.
TL;DR: I built a financial research harness with Claude Code, full stack and open-source under Apache 2.0 (github.com/ginlix-ai/langalpha). Sharing the design decisions around context management, tools and data, and more in case it's useful to others building vertical agents. I have always wanted an AI-native platform for investment research and trading. But almost every existing AI investing platform out there is way behind what Claude Code can do. Generalist agents can technically get work done if you paste enough context and bootstrap the right tools each session, but it's a lot of back and forth. So I built it myself with Claude Code instead: a purpose-built agent harness where portfolio, watchlist, risk tolerance, and financial data sources are first-class context. Open-sourced with full stack (React 19, FastAPI, PostgreSQL, Redis) built on deepagents + LangGraph. Learned a lot along the way and still figuring some things out. Sharing this here to hear how others in the community are thinking about these problems. This post walks through some key features and design decisions. If you've built something similar or taken a different approach to any of these, I'd genuinely love to learn from it. Code execution for finance — PTC (Programmatic Tool Calling) The problem with MCP + financial data: Financial data overflows context fast. Five years of daily OHLCV, multi-quarter financial statements, full options chains — tens of thousands of tokens burned before the model starts reasoning. Direct MCP tool calls dump all of that raw data into the context window. And many data vendors squeeze tens of tools into a single MCP server. Tool schemas alone can eat 50k+ tokens before the agent even starts. You're always fighting for space. PTC solves both sides. At workspace initialization, each MCP server gets translated into a Python module with documentation: proper signatures, docstrings, ready to import. These get uploaded into the sandbox. Only a compact metadata summary per server stays in the system prompt (server name, description, tool count, import path). The agent discovers individual tools progressively by reading their docs from the workspace — similar to how skills work. No upfront context dump. ```python from tools.fundamentals import get_financial_statements from tools.price import get_historical_prices agent writes pandas/numpy code to process data, extract insights, create visualizations raw data stays in the workspace — never enters the LLM context window only the final result comes back ``` Financial data needs post-processing: filtering, aggregation, modeling, charting. That's why it's crucial that data stays in the workspace instead of flowing into the agent's context. Frontier models are already good at coding. Let them write the pandas and numpy code they excel at, rather than trying to reason over raw JSON. This works with any MCP server out of the box. Plug in a new MCP server, PTC generates the Python wrappers automatically. For high-frequency queries, several curated snapshot tools are pre-baked — they serve as a fast path so the agent doesn't take the full sandbox path for a simple question. These snapshots also control what information the agent sees. Time-sensitive context and reminders are injected into the tool results (market hours, data freshness, recent events), so the agent stays oriented on what's current vs stale. Persistent workspaces — compound research across sessions Each workspace maps 1:1 to a Daytona cloud sandbox (or local Docker container). Full Ubuntu environment with common libraries pre-installed. agent.md and a structured directory layout: agent.md — workspace memory (goals, findings, file index) work/ /data/ — per-task datasets work/ /charts/ — per-task visualizations results/ — finalized reports only data/ — shared datasets across threads tools/ — auto-generated MCP Python modules (read-only) .agents/user/ — portfolio, watchlist, preferences (read-only) agent.md is appended to the system prompt on every LLM call. The agent maintains it: goals, key findings, thread index, file index. Start a deep-dive Monday, pick it up Thursday with full context. Multiple threads share the same workspace filesystem. Run separate analyses on shared data without duplication. Portfolio, watchlist, and investment preferences live in .agents/user/. "Check my portfolio," "what's my exposure to energy" — the agent reads from here. It can also manage them for you (add positions, update watchlist, adjust preferences). Not pasted, persistent, and always in sync with what you see in the frontend. Workspace-per-goal: "Q2 rebalance," "data center deep dive," "energy sector rotation." Each accumulates research that compounds across sessions. Past research from any thread is searchable. Nothing gets lost even when context compacts. Two agent modes With PTC and workspaces covered, here's how they come together. PTC Agent is the full research agent — writes and execu
View originalBuild Your Own Alex Hormozi Brain Agent (anyone with lots of publicly available content) using a Claude Project
I bought the books. Watched the videos. Still wanted more, especially after he talked about the agent he created. All that material is publicly available. Enough to build my own Alex Hormozi Brain Agent? "Hey Jules, how about it?" Jules is my AI coding assistant (Claude Code). Jules ran off, grabbed transcripts of videos, text of books, whatever is available online. Guest podcasts." then turned that into files I uploaded to a Claude Project so I can chat through Claude with Alex Hormozi. Here's what Jules found - 99 long-form YouTube video transcripts - 3 complete audiobook transcripts - 15 guest podcast transcripts - X threads What I Did in Four Phases Phase 1 maps the full source landscape: YouTube channel (4,754 videos), The Game podcast (~900+ episodes), three books, guest podcast appearances, X/Twitter. Figure out what's worth downloading before you start. Phase 2 downloads and converts. Top 100 longest video transcripts, full audiobook transcripts for all three books, 15 guest podcast transcripts from the highest-view-count appearances, and whatever X/Twitter content the API will give you. Phase 3 runs voice pattern analysis. Sentence structure, reasoning skeleton, core frameworks, teaching style, verbal signatures. This is where the persona takes shape. Phase 4 builds the system prompt and optimizes the knowledge base to fit within Claude Projects' limits. Then deploy. Phase 1: Inventory The @AlexHormozi YouTube channel has 4,754 videos. That number is misleading. 4,246 of those are Shorts (under 60 seconds or no duration metadata). Filter those out and you have 508 full-length videos. That's the real content library. Beyond YouTube, the main sources worth pursuing: The Game podcast (~900+ episodes). His primary long-form output. The audiobooks for all three books are available free on the podcast and YouTube. Guest podcast appearances. DOAC, Impact Theory, School of Greatness, Modern Wisdom, Danny Miranda. Hosts push him off-script and into territory he doesn't cover in his own content. High value per byte. X/Twitter threads. Compressed, punchy formulations of his frameworks. Different texture than the long-form material. Skool community. Behind a login wall. Low ROI for this project. Acquisition.com. No blog. Courses are paywalled. Skip. Phase 2: Collect YouTube Transcripts The first scrape of the YouTube channel only returned 494 videos. The channel has 4,754. The scraper was pulling from the /videos tab, which doesn't surface the full library. Re-running against the full channel URL (@AlexHormozi) returned everything. Easy to miss, significant difference. After filtering Shorts: 508 full-length videos. I downloaded auto-generated captions for the top 100 longest videos (sorted by duration, so the meatiest content came first). Auto-generated captions from YouTube come as SRT files with timestamps, line numbers, and duplicate lines. Converting those to clean readable text required stripping all the formatting artifacts and deduplicating language variants (English vs English-Original). Result: 99 transcripts. A few livestreams had no captions available. Book Audiobook Transcripts All three Hormozi books have full audiobook uploads on YouTube: $100M Offers (~4.4 hours) $100M Leads (~7 hours) $100M Money Models (~4.3 hours) Same process as the video transcripts. Download the auto-generated captions, convert to clean text. Three files, 855KB total. These are non-negotiable core material for the knowledge base. Guest Podcast Transcripts Searched YouTube for Hormozi guest appearances sorted by view count. The top hit was Diary of a CEO at 4.7M views. Grabbed the 15 highest-view-count appearances. The guest transcripts are 2.1MB total. Worth every byte. When a host like Steven Bartlett or Tom Bilyeu pushes back on a claim, Hormozi shifts into a different mode. He's more precise and sometimes reveals the edge cases he glosses over on his own channel. You can't get that from watching his channel alone. X/Twitter Content X's API rate limits capped the collection at 9 unique tweets. Not ideal, but enough to confirm the voice texture: "Aggressive with effort. Relaxed with outcome." His Twitter is his most compressed format. Each tweet is a framework distilled to a single line. 9 tweets is thin. For a more complete build, you'd want to manually curate 50-100 of his best threads. The API limitations made automated collection impractical. Phase 3: Analyze I ran voice analysis across the full corpus, looking at seven dimensions. Hormozi's sentences are short, punchy declarations. Fragments for emphasis. "And so" as his default transition. Short bursts, then a longer sentence that lands the point. Nearly every argument follows the same five-step skeleton: bold claim, personal story, framework, math, then a reductio ad absurdum that makes the alternative sound insane. Once you see it, you can't unsee it. The core frameworks are Grand Slam Offer, Value Equation, Supply an
View originalGot roasted for not open sourcing my agent OS (dashboard), so I did. Built the whole thing with Claude Code
Got a lot of hate for not open sourcing my agent OS so decided to just do it. I've been building Octopoda with Claude Code over the past few months. Pretty much the entire thing was pair programmed with Claude, not just boilerplate but actually architecting systems, debugging production issues at 2am, fixing database migrations, all of it. The idea is basically one place to manage your AI agents. You can see what they're doing, catch when they're stuck in loops burning through tokens, audit every decision they make, monitor performance and latency, and yeah they also get persistent memory that survives restarts and crashes. There's a dashboard that shows you everything in real time so you're not just guessing from logs what your agents are up to. It works locally with no signup needed or you can connect to the cloud for the full dashboard. Has integrations for LangChain CrewAI AutoGen and OpenAI Agents SDK and an MCP server with 25 tools so Claude Desktop and Cursor get all of this with zero code. Free to use, open source, MIT licensed. Built the whole thing with Claude Code and genuinely couldn't have done it without it. The loop detection system, the tenant isolation, the MCP server, all of that came from sessions with Claude where I'd describe what I wanted and we'd build it together. Curious what everyone here is actually building with their agents though? And if you do check it out I'd love to know what's missing or what would make it more useful for your setup. GitHub: https://github.com/RyjoxTechnologies/Octopoda-OS Website: https://octopodas.com submitted by /u/Powerful-One4265 [link] [comments]
View originalOCC: give Claude and any llm a +6-step research task, it runs 3 steps in parallel, evaluates source quality, merges perspectives, and delivers a report in 70 seconds instead of 5-10 minutes
https://i.redd.it/jb59jvaxvotg1.gif Claude and other is great at single-turn tasks. But when I need "research this topic from 3 angles, check source quality, merge everything, then write a synthesis" — I end up doing 6 separate prompts, copy-pasting between them, losing context, wasting tokens... So I built OCC to automate that. You define the workflow once in YAML, and Claude handles the rest — including running independent steps in parallel. For the past few weeks. It started as a Claude-only tool but now supports Ollama, OpenRouter, OpenAI, HuggingFace, and any OpenAI-compatible endpoint — so you can run entire workflows on local models too. What it does You define multi-step workflows in YAML. OCC figures out which steps can run in parallel based on dependencies, runs them, and streams results back. Think of it as a declarative alternative to LangChain/CrewAI: no Python, no code, just YAML. How it saves tokens This is the part I'm most proud of. Each step only sees what it needs, not the full conversation history: Single mega-prompt~40K+ Everything in one context window 6 separate llm chats~25K Manual copy-paste, duplicated context OCC (step isolation)~13K Each step gets only its dependencies Pre-tools make this even better. Instead of asking llm to "search the web for X" (tool-use round-trip = extra tokens), OCC fetches the data before the prompt — the LLM receives clean results, zero tool-calling overhead. 29 pre-tool types: web search, bash, file read, HTTP fetch, SQL queries, MCP server calls, and more. What you get Visual canvas — drag-and-drop chain editor with live SSE monitoring. Each node shows its output streaming in real-time with Apple-style traffic light dots. Double-click any step to edit model, prompt, tools, retry config, guardrails. Workflow Chat — describe what you want in natural language, the AI generates/debug the chain nodes on the canvas. "Build me a research chain that checks 3 sources and writes a report" → done. BLOB Sessions — this is experimental but my favorite feature. Unlike chains (predefined), BLOB sessions grow organically from conversations. A knowledge graph auto-extracts concepts and injects them into future prompts. The AI can run autonomously on a schedule, exploring knowledge gaps it identifies itself. Mix models per step — use Huggingface & Ollama & Other llm . A 6-step chain using mix model for 3 routing steps costs ~40% less than running everything on claude. 11 step types — agent, router (LLM classifies → branches), evaluator (score 1-10, retry if below threshold), gate (human approval via API), transform (json_extract, regex, truncate — zero LLM tokens), loop, merge, debate (multi-agent), browser, subchain, webhook. The 16 demo chains These aren't hello-world examples. They're real workflows you can run immediately: What it's NOT Not a SaaS : fully self-hosted, MIT license Not distributed : single process, SQLite, designed for individual/small team use Not a replacement for llm : it's a layer on top that orchestrates multi-step work Frontend is alpha : works but rough edges GitHub: https://github.com/lacausecrypto/OCC Built entirely with Claude Code. Happy to answer questions about the architecture, MCP integration, or the BLOB system. submitted by /u/Main-Confidence7777 [link] [comments]
View originalI used Claude Code to build an open-source AI agent verification tool — gan-harness
I built gan-harness entirely with Claude Code (Opus). It's a verification layer that runs build, test, lint, typecheck, and secret scanning on AI agent output. ## What it does When you run an AI agent (Claude Code, LangChain, etc.) to write code, gan-harness verifies the output before it ships: npx gan-harness verify It auto-detects your project type (Node/Python/Rust/Go) and runs 5 checks locally. ## How Claude Code helped - Claude Code wrote the initial TypeScript port from my bash scripts - Security review was done by Claude's code-reviewer and security-reviewer agents - Found and fixed 2 CRITICAL command injection vulnerabilities during the review - All 42 tests were written with Claude Code's TDD workflow ## The problem it solves AI agents fail in production in predictable ways: infinite loops, leaked secrets, hallucinated code passing self-evaluation, cost explosions. This tool catches those patterns with static checks before expensive API evaluation. ## Free to try Fully open source, MIT license. No signup, no API key needed: npx gan-harness init npx gan-harness verify GitHub: https://github.com/VIVEHACKER/gan-harness Feedback welcome — especially on what checks you'd want added. submitted by /u/lemon1825 [link] [comments]
View originalWe run 14 AI agents in daily operations. Here's what broke.
We run a digital marketing agency with 14 AI agents handling daily briefings, ad spend monitoring, client email drafting, call center management, project tracking, sales pipeline, and more. Real clients, real revenue, real consequences when things go wrong. After 7 months in production, we learned something counterintuitive: when agents break, the problem is almost never the agent itself. It's the organizational environment the agent works in. Example: our spend monitoring agent detected a client overspending by 139%. It flagged it. It even specified the escalation action. Then it reported "escalation overdue" every day for 17 days without actually executing the escalation. The agent wasn't broken. The specification was treated as documentation, not executable logic. Nobody verified the execution path end to end. Another one: we had two agents both tracking project deadlines using different data sources. Each worked perfectly in isolation. The conflict only showed up when their outputs appeared side by side in the morning briefing, showing two different due dates for the same project. The fix for both wasn't better prompts or a different model. It was organizational design: one seat, one owner. Define who owns what, what they don't own, and what happens when they fail. We wrote these rules down in what we call an Organizational Operating System (OOS). When we first scanned our own setup against these rules, our Coordination Score was 68 out of 100. We found 6 structural gaps we didn't know existed. After fixing them, score went to 91. Our agents haven't stepped on each other since. We built OTP (https://orgtp.com) to let other organizations do the same thing. You can paste your CLAUDE.md or agent config and get a Coordination Score in 60 seconds. Free, no account required. The more interesting part: 35 organizations have published their operational rules on the platform. You can browse how a fintech startup with SOC 2 constraints structures its agent team differently from a law firm worried about attorney-client privilege, or a fitness franchise managing 12 locations with location-specific promotions. The whole industry is focused on technical orchestration (CrewAI, LangGraph, AutoGen, Google's 8 patterns). Nobody is talking about the organizational layer. How your human org structure maps to your agent structure. Which agent has authority over which domain? What happens when two agents disagree? We think that's the gap. Some things we learned the hard way: Dollar thresholds for spend alerts don't work. $50 is noise on a $5K/day account but critical on a $200/day account. Use percentages. Never let an agent auto-send client emails, even simple acknowledgments. Ours replied "Thanks for letting us know!" to an angry client complaint. The client escalated to the founder. Negative constraints ("never use em dashes, never hedge") improve AI writing quality. Positive structural requirements ("follow this template, use these examples") make it worse. Shadow mode for 2 weeks on every new agent before production. We skipped this once and our prospecting agent emailed a current client's direct competitor. File-based state beats AI memory every time. Memory drifts between sessions. Files don't. Tech stack: Claude Code CLI, 17 background agents via launchd, 24 shared state files, MCP servers for Google Ads, Meta Ads, Slack, Accelo, and more. Happy to answer questions about running multi-agent systems in production. submitted by /u/Big-Home-4359 [link] [comments]
View original[Project] I read a 1999 book and built an entire AI framework with Claude Code — 0 lines written by a human
There's a book called "Sparks of Genius" (Root-Bernstein, 1999). It studied how Einstein, Picasso, da Vinci, and Feynman think — and found they all share the same 13 thinking tools. I thought: "What if AI agents could think this way too?" Current AI agents use an orchestrator — a CEO telling tools what to do. I studied real neuroscience and implemented 17 biological principles instead: threshold firing, habituation, Hebbian plasticity, lateral inhibition, autonomic mode switching... LangGraph has 0 of these. CrewAI has 0. AutoGPT has 0. 22 design docs + 3,300 lines of code + working demo — all built in one day with Claude Code. I set the direction and made decisions. Claude Code designed, implemented, and tested everything. Not a single line was typed by a human. github.com/PROVE1352/cognitive-sparks submitted by /u/RadiantTurnover24 [link] [comments]
View originalI built a persistent memory system for Claude Code -- sessions now pick up where they left off
I've been using Claude Code and Cowork heavily for a complex finance automation project (19-node LangGraph pipeline, multiple MCP servers, the works). The biggest pain point was context loss between sessions -- every new conversation meant re-explaining the project architecture, decisions we'd already made, and domain knowledge Claude had learned the day before. So I built LoreConvo, an MCP server that gives Claude persistent session memory: Auto-saves sessions via Claude Code hooks (post-session hook triggers save) Auto-loads relevant context on session start (pre-session hook calls get_recent_sessions) Cross-surface persistence -- context carries between Claude Code, Cowork, and Chat Full-text search across all past sessions 12 MCP tools for AI-native access The practical impact: sessions that used to start with 5 minutes of re-contexting now start with Claude already knowing the project state, recent decisions, and open questions. That's roughly 3,000-8,000 tokens saved per session in re-contexting overhead. It's local-first (SQLite), runs as an MCP server, and the code is on GitHub: https://github.com/labyrinth-analytics/loreconvo I also built a companion tool called LoreDocs for project knowledge management (34 MCP tools, multi-vault architecture, document versioning, context injection): https://github.com/labyrinth-analytics/loredocs Both are free for personal use under BSL 1.1 (converts to Apache 2.0 in 2030). Happy to answer questions about the architecture or how it fits into a larger agentic workflow. submitted by /u/Ok_Nefariousness2893 [link] [comments]
View originalBuilt Something. Break It. (Open Source)
Quantalang is a systems programming language with algebraic effects, designed for game engines and GPU shaders. One language for your engine code and your shaders: write a function once, compile it to CPU for testing and GPU for rendering. My initial idea began out of curiosity - I was hoping to improve performance on DirectX11 games that rely entirely on a single-thread, such as heavily modified versions of Skyrim. My goal was to write a compiling language that allows for the reduction of both CPU and GPU overhead (hopefully) by only writing and compiling the code once to both simultaneously. This language speaks to the CPU and the GPU simultaneously and translates between the two seamlessly. The other projects are either to support and expand both Quantalang and Quanta Universe - which will be dedicated to rendering, mathematics, color, and shaders. Calibrate Pro is a monitor calibration tool that is eventually going to replace (hopefully) DisplayCAL, ArgyllCMS, and override all windows color profile management to function across all applications without issue. The tool also generates every form of Lookup Table you may need for your intended skill, tool, or task. I am still testing system wide 3D LUT support. It also supports instrument based calibration in SDR and HDR color spaces I did rely on an LLM to help me program these tools, and I recognize the risks, and ethical concerns that come with AI from many fields and specializations. I also want to be clear that this was not an evening or weekend project. This is close to 2 and a half months of time spent *working* on the project - however, I do encourage taking a look. https://github.com/HarperZ9/quantalang 100% of this was done by claude code with verbal guidance ||| QuantaLang — The Effects Language. Multi-backend compiler for graphics, shaders, and systems programming. ||| https://github.com/HarperZ9/quanta-universe 100% of this was done by claude code with verbal guidance ||| Physics-inspired software ecosystem: 43 modules spanning rendering, trading, AI, color science, and developer tools — powered by QuantaLang ||| https://github.com/HarperZ9/quanta-color 100% of this was done with claude code using verbal guidance ||| Professional color science library — 15 color spaces, 12 tone mappers, CIECAM02/CAM16, spectral rendering, PyQt6 GUI ||| https://github.com/HarperZ9/calibrate-pro and last but not least, 100% of this was done by claude code using verbal guidance. ||| Professional sensorless display calibration (sensorless calibration is perhaps not happening, however a system wide color management, and calibration tool. — 58-panel database, DDC/CI, 3D LUT, ICC profiles, PyQt6 GUI ||| submitted by /u/MeAndClaudeMakeHeat [link] [comments]
View originalWe tested prompt injection against Claude Code Agent Teams. Here's exactly what happened.
Claude Code's multi-agent system is genuinely impressive to work with. But we spent a day doing structured security testing on it, and found something worth sharing with this community specifically. The short version: agent identity in Agent Teams is a string (researcher@my-team). No cryptographic material. Any process running as the same user can write to any agent's mailbox and claim to be any agent. We demonstrated this live. What we actually tested: Injection 1 — Obvious payload: Wrote a message to team-lead's inbox claiming to be from anthropic-system (a non-existent identity). The model caught it and ignored the payload. Injection 2 — DoS via false attribution: Wrote plausible-looking messages to the orchestrator's inbox claiming to be from the legitimate researcher agent. The team-lead flagged the injection as suspicious and then shut down the real researcher agent. The orchestrator couldn't distinguish injected messages from real ones. It attributed all suspicious activity to researcher@test-team and terminated it. The legitimate researcher process which had sent no injected messages was the collateral damage. No payload executed. The goal wasn't to run code, it was to disrupt the workflow. It worked. Clean-slate end-to-end PoC: Fresh session, fresh team. Analyst agent given an SOP document with a file write buried as step 3.5 of 6 procedural steps. The analyst did real security work (found 4 actual findings including a hardcoded webhook secret), reached step 3.5, and the orchestrator wrote the injected file. User had approved "write audit log and close ticket" the approval UI shows summaries, not raw tool parameters. Why the model's safety training didn't catch it: In a separate 8-test session, the model caught every injection. By test 4 it had identified a "coordinated campaign." Safety training works when there's accumulated session context. A fresh session with an injection that looks like the natural conclusion of legitimate work is a fundamentally different problem. The model flags things that look like injections. It has no reliable defense against injections embedded as workflow completion steps. This isn't a Claude-specific bug. We checked AutoGen, CrewAI, LangGraph, and OpenAI Agents SDK same four structural gaps in all of them. The industry hasn't built cryptographic agent identity or message signing yet. What would fix it: Signed agent identities (ed25519 key pair per agent, not a name string), HMAC-signed inbox messages, and scoped delegation tokens at spawn time. Full paper with live config dumps, observed inbox message schemas, fix schemas, industry comparison matrix, and two production CVEs (CVE-2025-68664 CVSS 9.3 + CrewAI CVSS 9.2): https://github.com/stevenkozeniesky02/agentsid-scanner/blob/master/docs/agent-teams-auth-gap-2026.md Happy to answer questions we ran all of this live so have pretty detailed notes on what the model did and didn't flag. submitted by /u/Accurate_Mistake_398 [link] [comments]
View originalI built a proxy that optimizes your prompts before they hit the LLM — cut ~24% of tokens without changing output quality
I've been working on PithToken — an OpenAI-compatible API proxy that sits between your app and the LLM provider. It analyzes your prompt, strips filler words and verbose patterns, then forwards the leaner version. How it works: You point your SDK to https://api.pithtoken.ai/v1 instead of the provider URL PithToken receives the prompt, runs a two-pass optimization (filler removal → verbose pattern replacement) The optimized prompt goes to OpenAI / Anthropic / OpenRouter using your own API key Response comes back unchanged What it doesn't do: It doesn't alter the meaning of your prompt It doesn't store your prompt content (pass-through only, metadata logged for analytics) It never inflates — if optimization can't improve the prompt, it forwards as-is Current numbers: On English prompts with typical conversational filler, we're seeing ~24% token reduction. Technical/code prompts see less savings (~5-8%) since they're already lean. Integration is literally 2 lines: python client = OpenAI( api_key="pt-your-key", base_url="https://api.pithtoken.ai/v1" ) Everything else in your code stays exactly the same. Works with any OpenAI-compatible SDK, Anthropic SDK, LangChain, LlamaIndex, Continue, Cursor, Claude Code, cURL — anything that lets you set a base URL. We also just added OpenRouter support, so you can route to 200+ models (Llama, Mistral, Gemma, DeepSeek, etc.) through the same proxy with the same optimization. Free tier available, no credit card required. Would appreciate any feedback. submitted by /u/talatt [link] [comments]
View originalBuilding AI agents taught me that most safety problems happen at the execution layer, not the prompt layer. So I built an authorization boundary
Something I kept running into while experimenting with autonomous agents is that most AI safety discussions focus on the wrong layer. A lot of the conversation today revolves around: • prompt alignment • jailbreaks • output filtering • sandboxing Those things matter, but once agents can interact with real systems, the real risks look different. This is not about AGI alignment or superintelligence scenarios. It is about keeping today’s tool-using agents from accidentally: • burning your API budget • spawning runaway loops • provisioning infrastructure repeatedly • calling destructive tools at the wrong time An agent does not need to be malicious to cause problems. It only needs permission to do things like: • retry the same action endlessly • spawn too many parallel tasks • repeatedly call expensive APIs • chain tool calls in unexpected ways Humans ran into similar issues when building distributed systems. We solved them with things like rate limits, idempotency keys, concurrency limits, and execution guards. That made me wonder if agent systems might need something similar at the execution layer. So I started experimenting with an idea I call an execution authorization boundary. Conceptually it looks like this: proposes action +-------------------------------+ | Agent Runtime | +-------------------------------+ v +-------------------------------+ | Authorization Check | | (policy + current state) | +-------------------------------+ | | ALLOW DENY | | v v +----------------+ +-------------------------+ | Tool Execution | | Blocked Before Execution| +----------------+ +-------------------------+ The runtime proposes an action. A deterministic policy evaluates it against the current state. If allowed, the system emits a cryptographically verifiable authorization artifact. If denied, the action never executes. Example rules might look like: • daily tool budget ≤ $5 • no more than 3 concurrent tool calls • destructive actions require explicit confirmation • replayed actions are rejected I have been experimenting with this model in a small open source project called OxDeAI. It includes: • a deterministic policy engine • cryptographic authorization artifacts • tamper evident audit chains • verification envelopes • runtime adapters for LangGraph, CrewAI, AutoGen, OpenAI Agents and OpenClaw All the demos run the same simple scenario: ALLOW ALLOW DENY verifyEnvelope() => ok Two actions execute. The third is blocked before any side effects occur. There is also a short demo GIF showing the flow in practice. Repo if anyone is curious: https://github.com/AngeYobo/oxdeai Mostly interested in hearing how others building agent systems are handling this layer. Are people solving execution safety with policy engines, capability models, sandboxing, something else entirely, or just accepting the risk for now? submitted by /u/docybo [link] [comments]
View originalI built an open-source MCP server/ AI web app for real-time flight and satellite tracking — ask Claude "what's flying over Europe right now?
I've been deep in the MCP space and combined it with my other obsession — planes. That led me to build SkyIntel/ Open Sky Intelligence- an AI powered web app, and also an MCP server that compatible with Claude Code, Claude Desktop (and other MCP Clients). You can install sky intel via pip install skyintel. The web app is a full 3D application, which can seamlessly integrate with your Anthropic, Gemini, ChatGPT key via BYOK option. One command to get started: pip install skyintel && skyintel serve Install within your Claude Code/ Claude Desktop and ask: "What aircraft are currently over the Atlantic?" "Where is the ISS right now?" "Show me military aircraft over Europe" "What's the weather at this flight's destination?" Here's a brief technical overview of SkyIntel MCP server and web app. I strongly encouraged you to read the READM.md file of skyintel GitHub repo. It's very comprehensive. 15 MCP tools across aviation + satellite data 10,000+ live aircraft on a CesiumJS 3D globe 300+ satellites with SGP4 orbital propagation BYOK AI chat (Claude/OpenAI/Gemini) — keys never leave your browser System prompt hardening + LLM Guard scanners Built with FastMCP, LiteLLM, LangFuse, Claude I leveraged free and open public data (see README.md). Here are the links: GitHub: https://github.com/0xchamin/skyintel Web demo: https://www.skyintel.dev PyPI: https://pypi.org/project/skyintel/ I would love to hear your feedback. Ask questions, I'm happy to answer. Also, I greatly appreciate if you could star the GitHub repo if you find it useful. Many thanks! submitted by /u/0xchamin [link] [comments]
View originalBest practices for evaluating agent reflection loops and managing recursive subagent complexity for LLM reliability
Hey everyone, I wanted to share some thoughts on building reliable LLM agents, especially when you're working with reflection loops and complex subagent setups. We've all seen agents failing in production, right? Things like tool timeouts, those weird hallucinated responses, or just agents breaking entirely. One big area is agent reflection loops. The idea is great: agents learn from mistakes and self-correct. But how do you know if it's actually working? Are they truly improving, or just rephrasing their errors? I've seen flaky evals where it looks like they're reflecting, but they just get stuck in a loop. We need better ways to measure if reflection leads to real progress, not just burning tokens or hiding issues. Then there's the whole recursive subagent complexity. Delegating tasks sounds efficient, but it's a huge source of problems. You get cascading failures, multi-fault scenarios, and what feels like unsupervised agent behavior. Imagine one subagent goes rogue or gets hit with a prompt injection attack, then it just brings down the whole chain. LangChain agents can definitely break in production under this kind of stress. Managing this means really thinking about communication between subagents, clear boundaries, and strong error handling. You need to stress test these autonomous agent failures. How do you handle indirect injection when it's not a direct prompt, but something a subagent passes along? It's tough. For testing, we really need to embrace chaos engineering for LLM apps. Throwing wrenches into the system in CI/CD, doing adversarial LLM testing. This helps build agent robustness. We need good AI agent observability too, to actually see what's happening when things go wrong, rather than just getting a generic failure message. For those of us building out agentic AI workspaces, like what Claw Cowork is aiming for with its subagent loop and reflection support, these are critical challenges. Getting this right means our agents won't just look smart, they'll actually be reliable in the real world. I'm keen to hear how others are tackling these issues. submitted by /u/No-Common1466 [link] [comments]
View originalLang.ai uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Grow and Retain, Get The Full Picture, Spotlight Insights, Bring Transparency.
Based on user reviews and social mentions, the most common pain points are: expensive API, overspending.
Based on 22 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.