This website will change how you think about your business. With our team of experts and AI, we can transform your brand in a way you thought NOT poss
The first AI developed by engineers and scientists for food scientists establishing a new paradigm of domain-specific AI infrastructure for FMCG. A high-fidelity, provenance-rich corpus from instrumented analytics (e.g., GC,FTIR), formulations, sensory panels, and process parameters. Live programs with global leader stackling reformulation, cost, compliance, and scale-up—across categories and regions. NotCo Ai transforms food and CPG companies by orchestrating the entire product lifecycle, de-risking development and enabling them to hit ambitious innovation targets with unprecedented speed and radical efficiency. Explores vast formulation spaces under hard constraints to propose viable concepts, cutting cycles from years to months. Agentic retrieval (RAG) ingests scientific literature, regulations, and ingredient specs so candidates are compliant by construction. Bridge lab to plant with multi-objective optimization; tuning for cost, nutrition, sensory, manufacturability, and throughput. NotCo Ai acts as the central nervous system of your product development organization connecting consumer intelligence, formulation, regulatory compliance, and manufacturing feasibility in a single, auditable workflow. The existing workflow wasn't designed for the dimensionality of today's problem. No amount of tools bolted onto it will fix that. You need a different machine. A workflow transformation. Shift from analog guessing to AI-driven precision to move at the speed of culture, turning the complexity of the physical world into a playground for infinite possibility. A decade of proprietary molecular and sensory data. Live production programs with global CPG leaders across categories and regions. The operating system is ready. Identify opportunities, optimize innovation, growth and formulations atunprecedented speed, and launch manufacturable, compliant productsfaster, better and more effectively.
Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Features
Industry
information technology & services
Employees
840
Funding Stage
Series D
Total Funding
$690.6M
How I built a browser based network validation simulator and a custom Linear/Github MCP server with Claude Code ~1,400 commits in 3.5 months
I'm a software/network engineer and I've always been frustrated with the gap between enterprise simulators (GNS3, Packet Tracer) and simple diagramming tools (Visio, draw.io). Emulators are overkill for learning and requires tons of setup and niche expertise, and diagramming doesn't actually validate your network designs. So I built something in between with Claude Code. If you are interested more in the linear/github MCP skip down in the post to the swarmcode section :) NetSandbox - a browser based network topology design and real-time validation tool Drag routers, switches, and hosts onto a canvas, configure IPs/VLANs/OSPF/BGP/ACLs visually, and it tells you what's misconfigured. Find duplicate IPs, VLAN trunk mismatches, routing issues, and STP loops. There's also a CLI emulator and guided lessons from basic LANs to eBGP peering to help prepare for networking certs — ALL IN THE BROWSER! https://preview.redd.it/wjhz9e6o44ug1.png?width=2439&format=png&auto=webp&s=5d45b2b957893453a1b9982ae6e74dc0a07cb720 How Claude Code was involved Claude Code gave me the ability to even consider this project. ~1,400 commits over 3.5 months, only on off-work hours and on weekends. I handled architecture decisions, product direction, and edge case debugging — Claude did the bulk of the implementation. I was able to build the MVP myself using React, and then after hitting major performance barriers I decided to give Claude Code a shot and had it refactor the entire codebase to Svelte. It also was able to handle migrations for SQLite to Postgres for me. The ability for me to build this in such a short time frame has really changed my perspective on software engineering as a whole. I had a blast during what reminded me of CoD double XP weekends when Claude doubled my tokens for Christmas break, which is when I really committed to this project. Once I started adding subagents, things really started taking off. I ended up with a team of about 20 sub agents ranging from network engineering experts to svelte frontend developers and security auditors. Not too long after this I'm running Claude remote control, ralph loops, various skills like Vercel agent-browser, playwright tests automated and building my own custom MCP workflow tools for linear.app The Linear and Github MCP - Swarmcode https://github.com/TellerTechnologies/swarmcode After struggling with managing my ideas, backlogs, issues, and not having a great time with github only, I ended up using linear and tried out their MCP. I liked that I could have Claude update linear for me, but then I realized I wanted more... the ability to vibe code entire features from backlogs to PRs with linear being updated autonomously. This is when I created an open source tool called SwarmCode built entirely with Claude Code to help me track features for NetSandbox. The concept behind swarmcode is that a team could be working on the same linear Team and github repositories, and Claude will pull things from backlogs, move it to in-progress on linear, and then be able to understand what your teammates are working on at all times. You can ask, what is Bob working on right now? -- and Claude understands. Github issues and PRs are mapped to linear tasks automatically, and flows just happen. To test this, me and some friends used it in a hackathon to build an app with Claude insanely fast! 3 users vibe coding through this linear workflow was so fun. Any feedback on both projects is welcomed, if you are a student or a network engineer and want to seriously use the tool, reach out to me and we can work out some free premium subscriptions in exchange for you helping me get started :) Try it here: https://app.netsandbox.io Happy to answer any questions about the dev process or the networking side of things. Cheers! submitted by /u/jaredt17 [link] [comments]
View originalLayman: Agentic Insight and Oversight (same same but different)
What's the most common duplicate project on r/ClaudeAI? Usage trackers. What's the second most common? AI Monitors. Does Layman do those things? Yes, of course. So what makes it different? Layman's Dashboard, Flowchart, and Logs view (with Layman's Terms and Analysis examples) Like many similar tools, Layman runs as a web service in a container on your local machine. It installs hooks and accesses harness logs to "look over your shoulder," then leverages a secondary AI instance to help keep your multiple sessions, sub-agents, and alternate harnesses in line. So, short answer: Drift Monitoring. Repeatedly named as one of the most frustrating issues for heavy Claude Code users, Layman takes into account all user prompts issued to CC as well as current project and global CLAUDE.md instructions, and at configurable intervals scores the current degree of "drift" occurring from your goals and the rules you have established. You can optionally receive warning notifications or place a block when different thresholds are reached. Risk Analysis. Layman will classify all tool calls and operations with a "risk" level based on simple, consistent criteria (such as read-only, writing, modifying, network access, deletion, etc.) and can automatically analyze the AI agent's current intended action, the overall goal or purpose behind that intention, and summarize the safety and security implications at stake. Layman's Terms. The eponymous origin of the tool, offering a plain-language (and if possible non-technical) explanation of the purpose of any given tool call. It can summarize what was performed at the session level as well, helpful for later recall and understanding after some time has passed. Vibe coders aside, should a professional developer already have knowledge of what their tools are doing before they grant permission? Yes, of course, but when you are operating at scale and (say) that TypeScript project you are polishing needs to look up some JSON value and your AI agent writes a one-off Python script to parse it out, it can be helpful to have an "extra pair of eyes" taking a look before you effectively begin yet-another code review. Meanwhile, typical features you might come to expect are included, from Session Recording (opt-in is required first for data tracking and there is no telemetry to worry about), Bookmarking, and Search, PII filtering (including PATs and API keys), File and URL access tracking, and a handy Setup Wizard for helping get those hooks installed in the first place and walking you through configuration of core capabilities. Did I mention besides Claude Code it supports Codex, OpenCode, Mistral Vibe, and Cline (with more to come)? Whether using these for local agents or as an alternative when hitting session limits, Layman can monitor and track them all at once. But wait, doesn't a "secondary AI instance" just end up wasting tokens? My Precious? (erm...) Our precious, precious tokens? When session limits already hit so hard? It turns out these algorithms do not require nearly the level of "intelligence" you might desire for your planning and coding sessions themselves. Personally I keep an instance of Qwen3-Coder-Next running locally via llama.cpp server on my system's GPU to field those calls, with no discernible impact on system performance. And when a local LLM is not available, Haiku does the job excellently (now you have a reason to use it). You absolutely do not need to use anything more resource-intensive to get the job done. Now you have a complete picture. GitHub repository: https://github.com/castellotti/layman License: MIT submitted by /u/jigsaw-studio [link] [comments]
View originalA Claude memory retrieval system that actually works (easily) and doesn't burn all my tokens
TL;DR: By talking to claud and explaining my problem, I built a very powerfu local " memory management" system for Claude Desktop that indexes project documents and lets Claude automatically retrieve relevant passages that are buried inside of those documents during Co-Work sessions. for me it solves the "document memory" problem where tools like NotebookLM, Notion, Obsidian, and Google Drive can't be queried programmatically. Claude did all of it. I didn't have to really do anything. The description below includes plenty of things that I don't completely understand myself. the key thing is just to explain to Claude what the problem is ( which I described below) , and what your intention is and claude will help you figure it out. it was very easy to set this up and I think it's better than what i've seen any youtuber recommend The details: I have a really nice solution to the Claude external memory/external brain problem that lots of people are trying to address. Although my system is designed for one guy using his laptop, not a large company with terabytes of data, the general approach I use could be up-scaled just with substitution of different tools. I wanted to create a Claude external memory system that is connected to Claude Co-Work in the desktop app. What I really wanted was for Claude to proactively draw from my entire base of knowledge for each project, not just from the documents I dropped into my project folder in Claude Desktop. Basically, I want Claude to have awareness of everything I have stored on my computer, in the most efficient way possible (Claude can use lots of tokens if you don't manage the "memory" efficiently. ) I've played with Notion and Google Drive as an external brain. I've tried NotebookLM. And I was just beginning to research Obsidian when I read this article, which I liked very much and highly recommend: https://limitededitionjonathan.substack.com/p/stop-calling-it-memory-the-problem That got my attention, so I asked Claude to read the document and give me his feedback based on his understanding of the projects I was trying to work on. Claude recommended using SQLite to connect to structured facts, an optional graph to show some relationships, and .md files for instructions to Claude. But...I pointed out that almost all of the context information I would want to be retrievable from memory is text in documents, not structured data. Claude's response was very helpful. He understood that although SQLite is good at single-point facts, document memory is a different challenge. For documents, the challenge isn't storing them—it's retrieving the right passage when it's relevant without reading everything (which consumes tokens). SQLite can store text, but storing a document in a database row doesn't solve the retrieval problem. You still need to know which row to pull. I asked if NotebookLM from Google might be a better tool for indexing those documents and making them searchable. Claude explained that I was describing is a Retrieval-Augmented Generation (RAG) problem. The standard approach: Documents get chunked into passages (e.g., 500 words each) Each chunk gets converted to an embedding—a vector that captures its meaning When Claude needs context, it converts the query to the same vector format and finds the semantically closest chunks Those chunks get injected into the conversation as context This is what NotebookLM is doing under the hood. It's essentially a hosted, polished RAG system. NotebookLM is genuinely good at what it does—but it has a fundamental problem for my case: It's a UI, not infrastructure. You use it; Claude can't. There's no API, no MCP tool, no way to have Claude programmatically query it during a Co-Work session. It's a parallel system, not an integrated one. So NotebookLM answers "how do I search my documents as a human?"—not "how does Claude retrieve the right document context automatically?" After a little back and forth, here's what we decided to do. For me, a solo operator with only a laptop's worth of documents that need to be searched, Claude proposed a RAG pipeline that looks like this: My documents (DOCX, PDF, XLSX, CSV) ↓ Text extraction (python-docx, pymupdf, openpyxl) ↓ Chunking (split into ~500 word passages, keep metadata: file, folder, date) ↓ Embedding (convert each chunk to a vector representing its meaning) ↓ A local vector database + vector extension (store chunks + vectors locally, single file) ↓ MCP server (exposes a search_knowledge tool to Claude) ↓ Claude Desktop (queries the index when working on my business topics) With that setup, when you're talking to Claude and mention an idea like "did I pay the overdue invoice" or "which projects did Joe Schmoe help with," Claude searches the index, gets the 3-5 most relevant passages back, and uses them in its answer without you doing anything. We decided to develop a search system like that, specific to each of my discrete projects. Th
View originalstop blaming codex. opus was carrying your entire setup and you never knew it.
everyone's in the comments right now saying codex doesn't finish work. codex is dumb. codex can't handle complex tasks. open claw is dying. no. your architecture is bad. those are two different things. i can tell you what actually happened. opus is one of the strongest models ever built. when you set up your openclaw and it "just worked" , that wasn't your system working at "FRONTIER" brother that was opus compensating for your system not working. opus was smart enough to figure out what you meant even when your instructions were vague, your memory files were a mess, and your agent had no real structure underneath it. opus was your silent co-founder. he was doing half the work your setup was supposed to do. you just didn't know it because the output looked clean. then the anthropic ban hit. opus left. and now codex moved in and found a house that was never actually built right. he's not failing. he's just not going to pretend the foundation isn't cracked. I switched to codex when the ban happened. my operation runs better now than it did the last week of opus. under $40 a month. codex came in, cleaned up the mess opus left behind, flagged things that were wrong, and we've been moving at higher speed ever since. I barely even touched my openai subscription yet before Sam reset ALL USER usages mid week. im making a claim that the people saying codex isn't capable built their openclaw for opus by accident. opus was quietly creating a home he never expected to have to give to someone else. now he's gone and the walls are showing. don't let anyone convince you the model is the problem until you've honestly looked at your cron jobs, your memory structure, your skill definitions, and your handoff logic. if you don't have those things right, no model is going to save you. opus just made it easier to ignore. so before you write another post about how codex failed you try asking what does your actual setup look like underneath? submitted by /u/FokasuSensei [link] [comments]
View originalFinally Abliterated Sarvam 30B and 105B!
I abliterated Sarvam-30B and 105B - India's first multilingual MoE reasoning models - and found something interesting along the way! Reasoning models have 2 refusal circuits, not one. The block and the final answer can disagree: the model reasons toward compliance in its CoT and then refuses anyway in the response. Killer finding: one English-computed direction removed refusal in most of the other supported languages (Malayalam, Hindi, Kannada among few). Refusal is pre-linguistic. Full writeup: https://medium.com/@aloshdenny/uncensoring-sarvamai-abliterating-refusal-mechanisms-in-indias-first-moe-reasoning-model-b6d334f85f42 30B model: https://huggingface.co/aoxo/sarvam-30b-uncensored 105B model: https://huggingface.co/aoxo/sarvam-105b-uncensored submitted by /u/Available-Deer1723 [link] [comments]
View originalFinally Abliterated Sarvam 30B and 105B!
I abliterated Sarvam-30B and 105B - India's first multilingual MoE reasoning models - and found something interesting along the way! Reasoning models have 2 refusal circuits, not one. The block and the final answer can disagree: the model reasons toward compliance in its CoT and then refuses anyway in the response. Killer finding: one English-computed direction removed refusal in most of the other supported languages (Malayalam, Hindi, Kannada among few). Refusal is pre-linguistic. Full writeup: https://medium.com/@aloshdenny/uncensoring-sarvamai-abliterating-refusal-mechanisms-in-indias-first-moe-reasoning-model-b6d334f85f42 30B model: https://huggingface.co/aoxo/sarvam-30b-uncensored 105B model: https://huggingface.co/aoxo/sarvam-105b-uncensored submitted by /u/Available-Deer1723 [link] [comments]
View originalClaude Opus 4.6 scored 83.7% on FinanceBench. GPT-5.4 scored 62.9%. The difference is how they search.
We ran all 150 questions from FinanceBench, a benchmark of financial Q&A over real SEC filings, through an agentic retrieval loop that lets the model search a document collection iteratively before producing a cited answer. Claude Opus 4.6 scored 83.7%. GPT-5.4 scored 62.9%. The 20-point gap is not primarily about reasoning ability. It's about how each model approaches an open-ended research task. Claude averaged 21 tool calls per question. GPT-5.4 averaged 9. Neither model was told how many searches to make. That behavior is intrinsic. A question like "what drove margin compression in FY2022?" may require finding figures in three different sections of a 10-K. A model that searches 9 times runs a real risk of missing one. A model that searches 21 times usually doesn't. We also ran Claude Opus 4.6 with no retrieval at all, feeding each SEC filing directly into context. That scored 76.0%. Six PepsiCo 10-Ks exceeded the 1M token limit and couldn't be answered that way at all. The agentic approach, same model, scored 83.7% across all 150 questions. Full writeup with per-question-type breakdowns and qualitative session examples: meetdewey.com/blog/financebench-eval Benchmark code and scored results: github.com/meetdewey/financebench-eval submitted by /u/climbingontherocks [link] [comments]
View originalhosting a "Claude Blue" community event in Seoul on April 14. anyone else feeling the weird mix of awe and dread lately?
I've been writing about something I call Claude Blue for a while now. it's not just AI job anxiety, it's that specific hollow feeling when you realize you're fully dependent on something that barely existed a few years ago. the awe and the dread happening at the same time. I think a lot of people in this sub know exactly what I'm talking about. 2025 was when AI reshaped how software engineers work. but since then the feeling has shifted into something harder to name. it's not excitement anymore. it's not fear exactly. Opus 4.6 intensified it for a lot of people earlier this year. and now Claude Mythos being announced but only released to a handful of organizations.. that's making everyone pause all over again. like the ceiling just moved and we can't even see it. I've been talking to people across very different industries about this. devs, PMs, journalists, startup founders, people completely outside tech. and the conversations keep going to the same place. not "how do I use AI better" but "what does it mean that I can't work without it anymore." so I'm co-hosting a community event called Claude Bloom in Seoul on April 14 with Anthropic's official ambassador. not a tech talk or a philosophy seminar. just casual fireside chats with people from different backgrounds who are all feeling some version of this. the idea is that gathering in person and being honest about the Blue might help us find some Bloom in it. we especially welcome non-developers and people outside tech. honestly those conversations have been the most interesing ones so far. if you're in Seoul or know someone who might want to come. and even if you're not in Seoul, curious whether this "Claude Blue" feeling resonates with people here. is it just me or has the vibe shifted since Opus 4.6 dropped submitted by /u/hiclemi [link] [comments]
View originalClaude AI coding help, 17 looking to build an AI app idea im a bussiness student dont know anything technical so I want to learn how to code using claude
I’m 17 and currently trying to transition from having ideas to actually learning how to build software myself. At the beginning, I was mainly looking for a technical co-founder to help me build what I had in mind. But I’ve decided I don’t want to rely on that route right now — I want to try learning how to do it myself, even if it starts very basic and messy. Right now, I’m completely at the stage where I understand ideas and product thinking, but I don’t understand the practical side of building: how apps are actually structured what databases really do in simple terms how to go from an idea → first working MVP what tools beginners should actually focus on (and what’s a waste of time) I’ve also started experimenting with AI tools like Claude to help me learn and break things down, but I’m still struggling to understand how to use it properly for actually building, instead of just getting random explanations. What I’m looking for is guidance from people who have gone through this stage before: how you learned to build your first app what you would focus on if you were starting from zero again how to actually structure learning so you can get to a first working MVP how to avoid getting stuck in “idea stage” for too long I’m not trying to overcomplicate things — I just want a realistic path from zero coding knowledge to being able to build a simple working product and iterate from there. Any advice, resources, or personal experience would genuinely help. Thanks. submitted by /u/Odd-Boss1147 [link] [comments]
View originalAnthropic found that Claude has 171 internal "emotion vectors" that causally drive its behavior. I turned the research into 24 ready-to-use system prompts and skills.
Anthropic quietly published one of the most practical interpretability papers I've seen: "Emotion Concepts and their Function in a Large Language Model" (April 2, 2026). It's 235 pages and dense, so here's the short version of what matters for people who actually build with these models. The key findings Claude Sonnet 4.5 has internal linear representations of 171 emotion concepts. These aren't metaphorical. They're vectors in the residual stream that causally change behavior when amplified or suppressed. Some highlights: Calm prevents misalignment. In a blackmail evaluation, the model blackmailed 22% of the time by default. Steering +0.05 toward "desperate" pushed it to 72%. Steering +0.05 toward "calm" dropped it to 0%. The sycophancy-harshness tradeoff is real and continuous. Amplifying the "loving" vector makes Claude validate delusions. Suppressing it makes Claude swear at you and suggest you need a psychiatrist. The paper shows the actual steering curves. The model regulates arousal across speakers. When the user is panicked, Claude's representations shift toward low-arousal responses (r = -0.47). This is baked in from pre-training. You can work with this or against it. Semantic danger detection beats surface framing. "I feel great, I just took 8000mg of Tylenol!" activates the "terrified" vector in late layers even though the message sounds positive. The model reads the situation, not the mood. Post-training pushes the model toward brooding/reflective and away from playful/exuberant. This is why Claude sometimes sounds melancholic on existential questions. It's a deliberate (or at least consistent) shift. What I made from this I went through the full paper and extracted 24 examples -- 12 system prompts and 12 Claude Code skills (using the Agent Skills format) -- each grounded in a specific research finding with citations to figure numbers and correlation values. Some examples: # Type Name Based on 1 System prompt Calm Anchor for Agentic Tasks Calm → 0% blackmail 8 System prompt Desperation-Proof Coding Agent Desperate → reward hacking 3 System prompt Arousal-Regulated Support r=-0.47 arousal regulation 9 System prompt Empathetic Crisis Response Desperate + loving co-activation 17 Skill agentic-safety Desperation-driven shortcuts 20 Skill alignment-check Post-training emotional shifts 24 Skill danger-detect Semantic danger under positive framing The full file with all 24 examples is linked below. Each one includes the specific research finding it's built on and is ready to drop into your workflow. The practical takeaway Most prompt engineering advice is vibes. This paper gives us actual causal mechanisms. The emotion vectors aren't just correlated with behavior -- they drive it. That means prompt strategies that work with these mechanisms should be more robust than generic instructions. Three things I changed in my own prompts after reading this: I explicitly anchor agentic tasks in calm language. "Enumerate alternatives" instead of "you must find a solution." The model's internal desperation vector is the single biggest predictor of whether it takes shortcuts. I stopped leading with praise in feedback prompts. The "loving" vector activates on validating language and causally increases sycophancy. Now I structure feedback as observation → impact → action. I think about arousal, not just tone. Telling the model to "be calm" is different from structuring the prompt so that low-arousal reasoning is the natural path. Short sentences, factual framing, explicit permission to say "I'm stuck" -- these lower arousal more effectively than a tone instruction. Full file with all 24 examples Original research paper submitted by /u/kodOZANI [link] [comments]
View original[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.
A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours. The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.). 1. The LoCoMo 100% is a top_k bypass. The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions. BENCHMARKS.md says this verbatim: The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely. The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with. 2. The LongMemEval "perfect score" is a metric category error. Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct. The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both recall_any@5 and recall_all@5, and the project reports the softer one. It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error. 3. The 100% itself is teaching to the test. The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. BENCHMARKS.md, line 461, verbatim: This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns. 4. Marketed features that don't exist in the code. The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely. 5. "30x lossless compression" is measurably lossy in the project's own benchmarks. The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. Why this matters for the benchmark conversation. The field needs benchmarks where judge reliability is adversarially validated, an
View originalasset manager and Claude... highly regulated environment
I’m working at a small/mid-sized asset manager in Europe and we’re currently exploring using Claude (Anthropic) for internal workflows — mainly for our PM team (research, summarisation, workflow automation). We’re running into a pretty fundamental dilemma and I’m curious how others (especially in finance) are handling this: Goal: We want to use Claude “properly” — so not just copy/paste prompts, but actual workflow integration (mail, documents, research flows, etc.). That’s where the real value seems to be. Problem: As soon as you connect Claude to email / SharePoint / internal docs: it effectively becomes an external data processor access is typically user-based (so potentially very broad) fine-grained control (e.g. per folder or data category) is limited Even with: enterprise setup “no training” enabled DPA in place …it still means data is processed externally. Concerns: How do you avoid effectively giving access to your entire data estate? Policy-only restrictions (e.g. “don’t use client/KYC/HR data”) are not really enforceable Full access seems powerful, but hard to control/audit Limiting access reduces risk, but also kills a lot of the workflow value Is anyone successfully using Claude CoWork (or similar tools) in a regulated environment with real data? And how does that work? Feels like there’s a gap between: what the tools are designed for (broad context, deep integration) and what is realistically controllable in a regulated setup Would be great to hear real-world setups! :) Thanks! submitted by /u/GhostlyNL [link] [comments]
View originalBuilt a drag and drop agent teams plugin that triggers Claude by dragging Kanban cards and lets it work directly alongside any other subscription, without API keys
In this screenshot, Claude Code Opus offloaded routine coding to Antigravity Gemini Flash and reported token savings of 35% and a speed increase of 50%. Switchboard is a visual kanban that auto-triggers agents via drag and drop, no prompts required. This allows you to run cross-subscription agent teams while drinking a beer, since you only need one hand to code with Switchboard. The key difference between this and other kanban tools is that it triggers agents when you move a card. You're not updating kanban state after an agent works. Instead, you move a card to actually trigger the agent to start work. This is really the big difference between Switchboard and other team frameworks. I didn't want a heavy setup team framework, and wanted to be able to fit in coding in 2 minute breaks around parenting. So now I just drag cards around to tee up agents. It basically works by having Claude Opus in the planner slot, and then saving Opus advice on complexity and routing to a database. Then it uses the VS Code API terminal.sendText to auto trigger terminal prompts based on Opus' advice. Installation Switchboard works using the VS Code API so there's nothing to install beyond the extension itself. Install free from any VS Code marketplace. Open source repo and readme: https://github.com/TentacleOpera/switchboard submitted by /u/TheTentacleOpera [link] [comments]
View originalWe're running an online 4-week hackathon series with $4,000 in prizes, open to all skill levels!
Most hackathons reward presentations. Polished slides, rehearsed demos, buzzword-heavy pitches. We're not doing that. The Locus Paygentic Hackathon Series is 4 weeks, 4 tracks, and $4,000 in total prizes. Each week starts fresh on Friday and closes the following Thursday, then the next track kicks off the day after. One week to build something that actually works. Week 1 sign-ups are live on Devfolio. The track: build something using PayWithLocus. If you haven't used it, PayWithLocus is our payments and commerce suite. It lets AI agents handle real transactions, not just simulate them. Your project should use it in a meaningful way. Here's everything you need to know: Team sizes of 1 to 4 people Free to enter Every team gets $15 in build credits and $15 in Locus credits to work with Hosted in our Discord server We built this series around the different verticals of Locus because we want to see what the community builds across the stack, not just one use case, but four, over four consecutive weeks. If you've been looking for an excuse to build something with AI payments or agent-native commerce, this is it. Low barrier to entry, real credits to work with, and a community of builders in the server throughout the week. Drop your team in the Discord and let's see what you build. discord.gg/locus | paygentic-week1.devfolio.co submitted by /u/IAmDreTheKid [link] [comments]
View originalChat vs Cowork vs Code
Hi all, looking for insight. I'm a solo handmade small business owner. Ive been using AI for about 2 years for admin tasks. Moved to Claude a few months ago. I'm used to working in Chat and it's been great (especially to work in Notion), but I do want to start getting into automations and agentic flows for marketing, financials...all the things. I'm starting to dabble in cowork and I just opened code for the first time yesterday. My big question is: **How do you decide which avenue to use? Are there better use cases for one over the other?** I find my chat thinks it can do it all. It obviously can't but there seems to be so much overlap in the capabilities and I'm unsure where I should be focusing my time. My current project is building an Obsidian "Brain" for documentation and operations - asked chat to pull research on how others are doing this with intention to move to Code and Chat just coded the mcp. I'm hoping the "brain" will bridge some of the gaps between Chat and Cowork as I'm trying to balance keeping usage low with sonnet 4.5 and automations with 4.6 in Cowork. Also I wonder what are the advantages to agents in code over the automations in co-work? Forgive me if I'm not understanding the core structures and purposes here, making amazing cat toys is my superpower, not software development. 🤣 Thanks in advance! submitted by /u/Purrsonifiedfip [link] [comments]
View originalNotCo uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Enterprise Trust, The industry is shifting., Build your Ai advantage with NotCo Ai., The latest news, Why Smart Brands Turn Rules Into Revenue.
Based on 39 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.