Harness is a unified, end-to-end AI software delivery platform to manage the SDLC using purpose-built AI agents.
Based on the available social mentions, users appear to view Harness AI positively as a development tool, with several mentions of successful projects built using it, including open-source tools that gained significant community traction (391+ GitHub stars). The platform seems particularly valued for AI-assisted development and building agent architectures, with users appreciating its capability to handle complex engineering problems. However, there's limited discussion about pricing or major complaints in these mentions. The overall sentiment suggests Harness AI is seen as a capable tool for developers working on AI agents and automation projects, though the sample size is relatively small for a comprehensive assessment.
Mentions (30d)
16
13 this week
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Based on the available social mentions, users appear to view Harness AI positively as a development tool, with several mentions of successful projects built using it, including open-source tools that gained significant community traction (391+ GitHub stars). The platform seems particularly valued for AI-assisted development and building agent architectures, with users appreciating its capability to handle complex engineering problems. However, there's limited discussion about pricing or major complaints in these mentions. The overall sentiment suggests Harness AI is seen as a capable tool for developers working on AI agents and automation projects, though the sample size is relatively small for a comprehensive assessment.
Features
Industry
information technology & services
Employees
1,700
Funding Stage
Series E
Total Funding
$802.1M
How to Make Claude Code Work Smarter — 6 Months Later (Hooks → Harness)
Hello, Orchestrators I wrote a post about Claude Code Hooks last November, and seeing that this technique is now being referred to as "Harness," I was glad to learn that many others have been working through similar challenges. If you're interested, please take a look at the post below https://www.reddit.com/r/ClaudeAI/comments/1osbqg8/how_to_make_claude_code_work_smarter/ At the time, I had planned to keep updating that script, but as the number of hooks increased and managing the lifecycle became difficult due to multi-session usage, I performed a complete refactoring. The original Hook script collection has been restructured into a Claude Code Plugin called "Pace." Since it's tailored to my environment and I'm working on other projects simultaneously, the code hasn't been released yet. Currently set to CSM, but will be changed to Pace. Let's get back to Claude Code. My philosophy remains the same as before. Claude Code produces optimal results when it is properly controlled and given clear direction. Of course, this doesn't mean it immediately produces production-grade quality. However, in typical scenarios, when creating a program with at least three features by adjusting only CLAUDE.md and AGENTS.md, the difference in quality is clearly noticeable compared to an uncontrolled setup. The current version of Pace is designed to be more powerful than the restrictions I previously outlined and to provide clearer guidance on the direction to take. It provides CLI tools tailored to each section by default, and in my environment, Claude Code's direct use of Linux commands is restricted as much as possible. As I mentioned in my previous post, when performing the same action multiple times, Claude Code constructs commands arbitrarily. At one point, I asked Claude Code: "Why do you use different commands when the result is the same, and why do you sometimes fail to execute the command properly, resulting in no output?" This is what came back: "I'm sorry. I was trying to proceed as quickly and efficiently as possible, so I acted based on my own judgment rather than following the instructions." This response confirmed my suspicion. Although AI LLMs have made significant progress, at least in my usage, they still don't fully understand the words "efficient" and "fast." This prompted me to invest more time refining the CLI tools I had previously implemented. Currently, my Claude Code blocks most commands that could break session continuity or corrupt the code structure — things like modifying files with sed or find, arbitrarily using nohup without checking for errors, or running sleep 400 to wait for a process that may have already failed. When a command is blocked, alternative approaches are suggested. (This part performs the same function as the hooks in the previous post, but the blocking methods and pattern recognition have been significantly improved internally.) In particular, as I am currently developing an integrated Auth module, this feature has made a clear difference when using test accounts to build and test the module via Playwright scripts — both for cookie-based and Bearer-based login methods. CLI for using test accounts Before creating this CLI, it took Claude Code over 10 minutes just to log in for module testing. The module is being developed with all security measures — device authentication, session management, MFA, fingerprint verification, RBAC — enabled during development, even though these are often skipped in typical workflows. The problem is that even when provided with account credentials in advance, Claude Code uses a different account every time a test runs or a session changes. It searches for non-existent databases, recreates users it claims don't exist, looks at completely wrong databases, and arbitrarily changes password hashes while claiming the password is incorrect — all while attempting to find workarounds, burning through tokens, and wasting context. And ultimately, it fails. That's why I created a dedicated CLI for test accounts. This CLI uses project-specific settings to create accounts in the correct database using the project's authentication flow. It activates MFA if necessary, manages TOTP, and holds the device information required for login. It also includes an Auto Refresh feature that automatically renews expired tokens when Claude Code requests them. Additionally, the CLI provides cookie-injection-based login for Playwright script testing, dynamic login via input box entry, and token provisioning via the Bearer method for curl testing. By storing this CLI reference in memory and blocking manual login attempts while directing Claude Code to use the CLI instead, it was able to log in correctly with the necessary permissions and quickly succeed in writing test scripts. It's difficult to cover all features in this post, but other CLI configurations follow a similar pattern. The core idea is to pre-configure the parts that Claude Code would exec
View originalI built an MCP server that turns Claude Code into a multi-agent review loop with per-agent skill learning
I've spent the last two months building gossipcat — an MCP server for Claude Code that runs a multi-agent review loop with per-agent skill learning — and I built it with Claude Code. What it actually does You install it as an MCP server (single 1.6 MB bundled file, drop it into your Claude Code MCP config and you're running). It lets Claude Code dispatch work to a portfolio of agents — Claude Code subagents run natively via the Agent tool, plus relay workers for Gemini, OpenClaw, and any OpenAI-compatible endpoint. Every agent that returns a finding has to cite file:line. Peer agents verify those citations against the actual source code. Verified findings and caught hallucinations get recorded as signals. Over time those signals build per-agent, per-category competency scores — trust boundaries, concurrency, data integrity, injection vectors, etc. A dispatcher routes future tasks to the agents strongest in each category. The part I didn't plan for When an agent's accuracy drops in a category, the system reads their recent hallucinations and generates a targeted skill file — a markdown prompt intervention tailored to the exact mistakes they've been making — and injects it on the next dispatch. No fine-tuning. No weights touched. The "policy update" is a file under .gossip/agents/ /skills/. It's effectively in-context reinforcement learning at the prompt layer, with reward signals grounded in real source code instead of a judge model. Why I built it (the build story) I didn't start here. Two months ago I just wanted to stop being a bottleneck for code review. I was running Claude Code for everything, but every non-trivial review produced a mix of real findings and confidently hallucinated ones, and I kept having to manually verify each claim against the actual file to know which was which. Single-agent review had a ceiling and it was my patience. First attempt was the obvious one: run two agents in parallel, compare outputs, trust what they agreed on. That caught some hallucinations but missed a lot — two agents can confidently agree on something neither of them checked. It also didn't scale the thing I actually wanted to scale: verification. The shift was realizing that verification could be mechanical, not subjective. If every finding has to cite file:line and peers have to confirm the citation against source, you don't need a judge model at all. You need a format contract and a reader. That's when the whole thing started to make sense as a pipeline: findings → citations → peer verification → signals Once signals existed, it was obvious they should feed competency scores. Once scores existed, it was obvious they should steer dispatch. Once dispatch was steered, it was obvious that agents accumulating hallucinations in a category should get a targeted intervention. Each step felt like the previous step forcing my hand, not like a plan. A few things I learned along the way that might transfer to your own projects: Grounded rewards beat LLM-as-judge, even for subjective work. The moment I made reviewers verify mechanical facts (does this file:line exist, does it say what the finding claims) instead of grading quality, the feedback loop got dramatically cleaner. Agents stopped disagreeing about taste and started disagreeing about reality. Reality has a ground truth; taste doesn't. Closing the loop is 10x harder than opening it. Writing verdicts is easy. Actually reading them back in the forward pass is where most agent systems quietly stay open. I caught my own project doing this in a consensus review today — the next section is that story. You don't need fine-tuning to improve agents. The "policy update" in this system is literally a markdown file. When an agent fails, the system reads their recent mistakes and writes them a targeted skill file that gets injected on their next dispatch. No weights, no training infra, no gradient anything. It's in-context learning with actual memory, and it works surprisingly well. Two months of iterative discovery beat six months of planning. Every major feature in gossipcat exists because an earlier feature made it obvious. I have a docs/ folder full of specs I wrote for features I never built, and none of the features I actually shipped are in there. How Claude Code helped build this The whole project was built with Claude Code. I used it as my primary pair for two months — it wrote the vast majority of the TypeScript, helped me design the consensus protocol and the signal pipeline, debugged its own output more times than I can count, and generated large parts of the skill-engine and cross-review infrastructure. Today, while I was drafting this post, I ran a consensus review on the system's own effectiveness tracking — Claude Code (Sonnet and Opus sub-agents as two separate reviewers) caught two critical bugs Claude Code main agent missed, I fixed them with Claude Code's help, tests pass, and the fix shipped 20 minutes before I finished this draft. The
View originalLayman: Agentic Insight and Oversight (same same but different)
What's the most common duplicate project on r/ClaudeAI? Usage trackers. What's the second most common? AI Monitors. Does Layman do those things? Yes, of course. So what makes it different? Layman's Dashboard, Flowchart, and Logs view (with Layman's Terms and Analysis examples) Like many similar tools, Layman runs as a web service in a container on your local machine. It installs hooks and accesses harness logs to "look over your shoulder," then leverages a secondary AI instance to help keep your multiple sessions, sub-agents, and alternate harnesses in line. So, short answer: Drift Monitoring. Repeatedly named as one of the most frustrating issues for heavy Claude Code users, Layman takes into account all user prompts issued to CC as well as current project and global CLAUDE.md instructions, and at configurable intervals scores the current degree of "drift" occurring from your goals and the rules you have established. You can optionally receive warning notifications or place a block when different thresholds are reached. Risk Analysis. Layman will classify all tool calls and operations with a "risk" level based on simple, consistent criteria (such as read-only, writing, modifying, network access, deletion, etc.) and can automatically analyze the AI agent's current intended action, the overall goal or purpose behind that intention, and summarize the safety and security implications at stake. Layman's Terms. The eponymous origin of the tool, offering a plain-language (and if possible non-technical) explanation of the purpose of any given tool call. It can summarize what was performed at the session level as well, helpful for later recall and understanding after some time has passed. Vibe coders aside, should a professional developer already have knowledge of what their tools are doing before they grant permission? Yes, of course, but when you are operating at scale and (say) that TypeScript project you are polishing needs to look up some JSON value and your AI agent writes a one-off Python script to parse it out, it can be helpful to have an "extra pair of eyes" taking a look before you effectively begin yet-another code review. Meanwhile, typical features you might come to expect are included, from Session Recording (opt-in is required first for data tracking and there is no telemetry to worry about), Bookmarking, and Search, PII filtering (including PATs and API keys), File and URL access tracking, and a handy Setup Wizard for helping get those hooks installed in the first place and walking you through configuration of core capabilities. Did I mention besides Claude Code it supports Codex, OpenCode, Mistral Vibe, and Cline (with more to come)? Whether using these for local agents or as an alternative when hitting session limits, Layman can monitor and track them all at once. But wait, doesn't a "secondary AI instance" just end up wasting tokens? My Precious? (erm...) Our precious, precious tokens? When session limits already hit so hard? It turns out these algorithms do not require nearly the level of "intelligence" you might desire for your planning and coding sessions themselves. Personally I keep an instance of Qwen3-Coder-Next running locally via llama.cpp server on my system's GPU to field those calls, with no discernible impact on system performance. And when a local LLM is not available, Haiku does the job excellently (now you have a reason to use it). You absolutely do not need to use anything more resource-intensive to get the job done. Now you have a complete picture. GitHub repository: https://github.com/castellotti/layman License: MIT submitted by /u/jigsaw-studio [link] [comments]
View originalBurned 5B tokens with Claude Code in March to build a financial research agent.
TL;DR: I built a financial research harness with Claude Code, full stack and open-source under Apache 2.0 (github.com/ginlix-ai/langalpha). Sharing the design decisions around context management, tools and data, and more in case it's useful to others building vertical agents. I have always wanted an AI-native platform for investment research and trading. But almost every existing AI investing platform out there is way behind what Claude Code can do. Generalist agents can technically get work done if you paste enough context and bootstrap the right tools each session, but it's a lot of back and forth. So I built it myself with Claude Code instead: a purpose-built agent harness where portfolio, watchlist, risk tolerance, and financial data sources are first-class context. Open-sourced with full stack (React 19, FastAPI, PostgreSQL, Redis) built on deepagents + LangGraph. Learned a lot along the way and still figuring some things out. Sharing this here to hear how others in the community are thinking about these problems. This post walks through some key features and design decisions. If you've built something similar or taken a different approach to any of these, I'd genuinely love to learn from it. Code execution for finance — PTC (Programmatic Tool Calling) The problem with MCP + financial data: Financial data overflows context fast. Five years of daily OHLCV, multi-quarter financial statements, full options chains — tens of thousands of tokens burned before the model starts reasoning. Direct MCP tool calls dump all of that raw data into the context window. And many data vendors squeeze tens of tools into a single MCP server. Tool schemas alone can eat 50k+ tokens before the agent even starts. You're always fighting for space. PTC solves both sides. At workspace initialization, each MCP server gets translated into a Python module with documentation: proper signatures, docstrings, ready to import. These get uploaded into the sandbox. Only a compact metadata summary per server stays in the system prompt (server name, description, tool count, import path). The agent discovers individual tools progressively by reading their docs from the workspace — similar to how skills work. No upfront context dump. ```python from tools.fundamentals import get_financial_statements from tools.price import get_historical_prices agent writes pandas/numpy code to process data, extract insights, create visualizations raw data stays in the workspace — never enters the LLM context window only the final result comes back ``` Financial data needs post-processing: filtering, aggregation, modeling, charting. That's why it's crucial that data stays in the workspace instead of flowing into the agent's context. Frontier models are already good at coding. Let them write the pandas and numpy code they excel at, rather than trying to reason over raw JSON. This works with any MCP server out of the box. Plug in a new MCP server, PTC generates the Python wrappers automatically. For high-frequency queries, several curated snapshot tools are pre-baked — they serve as a fast path so the agent doesn't take the full sandbox path for a simple question. These snapshots also control what information the agent sees. Time-sensitive context and reminders are injected into the tool results (market hours, data freshness, recent events), so the agent stays oriented on what's current vs stale. Persistent workspaces — compound research across sessions Each workspace maps 1:1 to a Daytona cloud sandbox (or local Docker container). Full Ubuntu environment with common libraries pre-installed. agent.md and a structured directory layout: agent.md — workspace memory (goals, findings, file index) work/ /data/ — per-task datasets work/ /charts/ — per-task visualizations results/ — finalized reports only data/ — shared datasets across threads tools/ — auto-generated MCP Python modules (read-only) .agents/user/ — portfolio, watchlist, preferences (read-only) agent.md is appended to the system prompt on every LLM call. The agent maintains it: goals, key findings, thread index, file index. Start a deep-dive Monday, pick it up Thursday with full context. Multiple threads share the same workspace filesystem. Run separate analyses on shared data without duplication. Portfolio, watchlist, and investment preferences live in .agents/user/. "Check my portfolio," "what's my exposure to energy" — the agent reads from here. It can also manage them for you (add positions, update watchlist, adjust preferences). Not pasted, persistent, and always in sync with what you see in the frontend. Workspace-per-goal: "Q2 rebalance," "data center deep dive," "energy sector rotation." Each accumulates research that compounds across sessions. Past research from any thread is searchable. Nothing gets lost even when context compacts. Two agent modes With PTC and workspaces covered, here's how they come together. PTC Agent is the full research agent — writes and execu
View originalI set up GPT 5.4 to review Claude's code inside Claude Code. The cross-model workflow catches things self-review never does
OpenAI released a Codex plugin for Claude Code last week. You can now run GPT 5.4 directly from your Claude Code terminal without switching environments. Two of the strongest models available, working together in one workflow. I have been using it for a week. Here is how it works and what I found. As we know, every model has blind spots for its own patterns. Claude writes code, you ask Claude to review that code, Claude says it looks good. Then the bug shows up in production. Anthropic described this in their harness paper: builders who evaluate their own work are systematically overoptimistic. The maker and the checker need to be separate. A chef who tastes only their own food will always think it is excellent. The fix: have a different model do the review. The Codex plugin makes this trivially easy. The workflow The plugin adds two review commands. /codex:review runs a standard code review on your uncommitted changes. Read-only, changes nothing in your code. Use it before you push. /codex:adversarial-review goes deeper. It questions your implementation choices and design decisions, not just the code itself. I use this one when I want to know whether my approach is actually optimal. Also read-only. For larger diffs the review can take a while. Codex offers to run it in the background. Check progress with /codex:status. My daily flow looks like this: Claude writes the code (backend, architecture, complex logic) Before committing: /codex:review For bigger decisions: /codex:adversarial-review on top Claude fixes the issues Codex found Ship The difference to self-review is noticeable. Codex catches edge cases and performance issues that Claude waves through. Different training, different habits, different blind spots. Where each model is stronger On the standard benchmarks they are close. SWE-bench Verified: GPT 5.4 at 80%, Opus 4.6 at 80.8%. HumanEval: 93.1% vs 90.4%. The real gap shows on SWE-bench Pro, which is harder to game: GPT 5.4 at 57.7%, Opus 4.6 at roughly 45%. Significant advantage for GPT on complex real-world engineering problems. In daily use each model has clear strengths. Codex produces more polished frontend results out of the box. If you need a prototype that looks good immediately, Codex is the faster path. Claude is stronger at backend architecture, multi-file refactoring and structured planning. Claude's Plan Mode is still ahead when you set up larger builds. The weaknesses are equally clear. Claude tends to over-engineer: you ask for a simple function and get an architecture designed to scale for the next decade. Codex produces slightly more rigid naming conventions. Neither is perfect, but together they balance each other out. Cost matters too. GPT 5.4 runs at $2.50 per million input tokens and $15 output. Opus 4.6 costs $5 input and $25 output. GPT is half the price on input and 40% cheaper on output. For an agent team running all day, that adds up. Setup in three commands You need a ChatGPT account. A free one works. # Step 1: Add the OpenAI marketplace /plugin marketplace add openai/codex-plugin-cc # Step 2: Install the Codex plugin /plugin install codex@openai-codex # Step 3: Connect your ChatGPT account /codex:setup At step 2 you get asked whether to install for the current project or globally. Pick "Install for you" so it is available everywhere. Step 3 opens a browser window for authentication. One requirement: your project needs an initialized git repository. Codex starts with git status and aborts if there is no git. Verify with /codex. You should see a list of available Codex commands. If the plugin does not show up, run /reload-plugins. What I would do differently I started by running /codex:adversarial-review on everything. That is overkill for small changes. Now I use the standard review for routine work and save the adversarial version for architectural decisions or complex features. The standard review is fast enough to run on every commit without slowing you down. If you have Claude Code set up already, this takes three minutes to install. Try /codex:review on your next feature before you push. The difference to letting Claude review its own code is immediate. Has anyone else tried combining models for code review? Curious whether people are using other cross-model setups or sticking with single-model workflows. submitted by /u/Ok_Today5649 [link] [comments]
View original"Spud" vs Mythos
With the recent talks of both "next-gen" models, I still really wonder if it will be enough. I made several posts previously about the current limitations of AI for coding, that, there's basically still this ceiling it cannot truly converge on production-grade code on complex repos, with a "depth" degradation of sorts, it cannot ever bottom out basically. I've been running Codex 24/7 for the past 6 months straight since GPT-5, using over 10 trillion tokens (total cost only around $1.5k in Pro sub). And I have not been able to close a single PR that I tried to close where I was running extensive bug sweeps to basically fix all bug findings. It will forever thrash and find more bugs of the same class over and over, implement the fixes, then find more and more and more. Literally forever. No matter what I did to adjust the harness and strengthen the prompt, etc. It never could clear 5+ consecutive sweeps with 0 P0/1/2 findings. Over 3000+ commits of fixes, review, sweeps in an extensive workflow automation (similar to AutoResearch). They love to hype up how amazing the models are but this is still the frontier. You can't really ship real production-grade apps, that's why you've never seen a single person use AI "at scale", like literally build an app like Facebook or ChatGPT. All just toy apps and tiny demos. All shallow surface-level apps and "fun" puzzles or "mock-up" frontend websites for a little engagement farming. The real production-grade apps are built still with real SWEs that simply use AI to help them code faster. But AI alone is not even close to being able to deliver on a real product when you actually care about correctness, security, optimization, etc. They even admit in the recent announcement about Mythos, that it's not even close to an entry level Research Scientist yet. So the question really is, when will, if ever, AI be capable enough to fully autonomously deliver production-grade software? We will see what the true capabilities of the spud model is hopefully soon, but my hunch is we are not even scratching the surface of truly capable coding agents. These benchmarks they use, where they hit 80-90%, are really useless in the scheme of things; if you tried to use them as a real metric to usefulness, you would probably need to hit the equivalent of like 200-300% on these so-called benchmarks before they are actually there. Until they come up with a benchmark that is actually measures against real-world applications. What do you guys think? submitted by /u/immortalsol [link] [comments]
View originalI wanted to build Jarvis on Claude Code on day one. 6 months later, here's Wiz, what actually works, and the 9 mistakes I made along the way.
Back in October I started building my own AI agent on Claude Code. I call it Wiz. My original fantasy was Jarvis from Iron Man: one agent that ran my whole life, handled the business, wrote the blog, managed the calendar, triaged the inbox. The whole thing. From week one. That was the biggest mistake I made, and basically everything else downstream of it was a consequence. What Wiz is: a personal AI agent I use every day, built on Claude Code as the harness. CLAUDE.md is the instructions file, memory lives in markdown files, tools are just scripts in folders. It runs morning reports, evening summaries, inbox triage, and a bunch of experiments autonomously. For anything creative or quality-sensitive, I'm still in the loop. How Claude helped: honestly, Claude Code built most of it with me. I described what I wanted, read every file it wrote, corrected the bad parts, and iterated. The /init command gave me my first CLAUDE.md in one shot. When things broke (they broke often), I'd paste errors back to Claude Code and it would walk me through the diagnosis. Six months in, Claude Code is both the tool I use to build Wiz and the runtime Wiz runs on. The mistakes that burned me the most: Let Claude generate my first CLAUDE.md without reading it carefully. Hours of weird bugs traced back to a single bad sentence at the top. Let self-improvement rewrite my core instructions with no guardrails. It drifted in five directions at once. Ran Opus on every tiny query until I hit usage limits before lunch. Model routing fixed it (small/local for simple stuff, Sonnet for general, Opus for hard calls). Tried to build Jarvis on day one when I should've built incrementally. That one fantasy cost me about three months. Put an LLM call in every step of every pipeline when most of it should've been plain scripts. Wiz is a personal project, not something I'm releasing, but I wrote up the full architecture and all 9 mistakes in a post on Digital Thoughts. Includes a step-by-step walk-through of building a real first agent (something small that reads your overnight email and writes a one-paragraph morning summary). Free to read, no paywall: https://thoughts.jock.pl/p/how-to-build-your-first-ai-agent-beginners-guide-2026 Happy to answer questions about Wiz, Claude Code specifics, or any of the mistakes in the comments. submitted by /u/Joozio [link] [comments]
View originalI used Claude Code to build an open-source AI agent verification tool — gan-harness
I built gan-harness entirely with Claude Code (Opus). It's a verification layer that runs build, test, lint, typecheck, and secret scanning on AI agent output. ## What it does When you run an AI agent (Claude Code, LangChain, etc.) to write code, gan-harness verifies the output before it ships: npx gan-harness verify It auto-detects your project type (Node/Python/Rust/Go) and runs 5 checks locally. ## How Claude Code helped - Claude Code wrote the initial TypeScript port from my bash scripts - Security review was done by Claude's code-reviewer and security-reviewer agents - Found and fixed 2 CRITICAL command injection vulnerabilities during the review - All 42 tests were written with Claude Code's TDD workflow ## The problem it solves AI agents fail in production in predictable ways: infinite loops, leaked secrets, hallucinated code passing self-evaluation, cost explosions. This tool catches those patterns with static checks before expensive API evaluation. ## Free to try Fully open source, MIT license. No signup, no API key needed: npx gan-harness init npx gan-harness verify GitHub: https://github.com/VIVEHACKER/gan-harness Feedback welcome — especially on what checks you'd want added. submitted by /u/lemon1825 [link] [comments]
View original5,355 upvotes on a post about teaching Claude to talk like a caveman. the Claude subreddits had a weekend.
https://preview.redd.it/vxcg7bvqogtg1.png?width=1104&format=png&auto=webp&s=e68379569e4a4f9dae303d0af920817ef827dbc3 I run Claude Code Daily. every day I scan r/ClaudeCode, r/ClaudeAI, and r/vibecoding for the posts, repos, and comments that actually matter. here's Friday through Sunday in one post. Friday: the ban, the credits, and the caveman Anthropic killed third-party harnesses like OpenClaw from using subscription plans. simultaneously handed out API credits ($20 Pro, $100 Max 5x, $200 Max 20x). carrot and stick in the same email. then someone taught Claude to talk like a caveman. 75% fewer tokens per response. top comment from u/fidju at 1,619 upvotes: "Why waste time say lot word when few word do trick." usage limit complaints hit day 10 in the data. it stopped being a trend and became a genre. Saturday: memes, mourning, and actually cool stuff 4 of the top 20 posts were shitposts. the community entered the memes-as-therapy phase. OpenClaw discourse hit 1,200+ upvotes and 600+ comments across three threads. someone posted "Alright, I'm gonna be a dick. CC is fine" and collected 189 upvotes with 180 comments. that's not a post, that's a battlefield. but the builders kept building. 🔧 Vibeyard (190 upvotes) dropped an open-source IDE that embeds a browser into Claude Code. click an element, Claude sees the DOM path. no more describing which blue button. 🔧 a senior engineer dropped a masterclass on git worktrees for parallel Claude sessions (293 upvotes, 140 comments). real workflow patterns, not theory. 🔧 someone sent Claude back to 1998 and it rebuilt their childhood PC. 618 upvotes. the internet needed a hug. Sunday (Easter): the plot twist OpenClaw gets banned Saturday. holiday lowers traffic Sunday. suddenly... rate limits feel normal again. two threads (257 and 272 upvotes) full of cautious celebration. the best new repo was a devil's advocate skill for Claude Code that forces a second pass arguing against its own decisions before proceeding. because Claude's biggest weakness is agreeing too fast. someone also built an AI job search system with Claude, scored 740+ offers, landed a job, then open sourced the whole thing. 237 upvotes. fastest rising post of the day by 4x. stuff worth stealing from this weekend: add this to your CLAUDE.md: "be careful, we are live on prod". multiple builders reported better output quality from this one line. zero extra tokens. the caveman system prompt pattern works. skip filler, no greetings, shortest correct phrasing. it's compression, not a joke. git worktrees for running multiple Claude Code sessions on the same repo without merge conflicts. "git worktree add ../feature-auth feature/auth" and each session gets its own branch and working directory. full daily breakdowns with repos, code drops, and the best comments live link in cs. shawn tenam⚡ GTM Engineer submitted by /u/Shawntenam [link] [comments]
View originalAI Harness at Architecture Layer
Disclosure: I built the open-source project mentioned below. A lot of current harness engineering discussion focuses on execution quality: context management, tool access, task decomposition, review loops, evaluation, and memory. I think there is still a separate failure mode that those improvements do not solve: even with a strong execution harness, agents can still produce architectures that are technically coherent but wrong for the actual team and operating context. What I have been exploring is an architecture-layer harness. The implementation is an open-source project called Architecture Compiler: https://github.com/inetgas/arch-compiler The technical approach has 3 parts. First, there is a pattern registry. Each pattern encodes constraints, NFR support, cost/adoption trade-offs, and provides/requires relationships. The idea is to make recurring architectural judgment machine-readable rather than leaving it in docs or chat history. Second, there is a deterministic architecture compiler that takes a canonical spec and evaluates those patterns against explicit constraints such as platform, language, providers, availability target, retention, and cost ceilings. Same input produces the same selected patterns and rejected-pattern reasons. The point is not model creativity; it is reproducibility and reviewability. Third, there are AI workflow skills around that compiler that force an approval and re-approval boundary. If planning or implementation changes architectural assumptions, the workflow is supposed to route back through compilation instead of silently treating those changes as implementation detail. I tested this on a Bird ID web app case study: https://github.com/inetgas/arch-compiler-ai-harness-in-action It is not a substitute for human architecture judgment, but a way to make that judgment more reviewable and enforceable downstream. I’m interested in whether others are addressing this problem differently: - policy files only - templates - ADRs - eval gates - more deterministic orchestration Optional background write-up: https://inetgas.substack.com/p/ai-harness-engineering-at-the-architecture submitted by /u/inetgas [link] [comments]
View originalI built an open-source agent framework for Claude Code after the recent Pro/Max subscription changes: KyberBot
With the recent changes to Claude's Pro/Max subscriptions affecting some third-party agent setups, I wanted to share something I built that's completely unaffected: KyberBot. I built KyberBot over the past several months using Claude Code as the primary development environment, it's actually the tool I used to write most of the codebase itself, iterating on agent architecture and memory systems through Claude Code. The project is specifically designed to extend Claude Code rather than work around it. What it is: An open-source personal AI agent that runs natively inside the Claude Code harness, No separate framework. What I built into it: Claude Code is already an exceptional agent, so I just added persistent SOTA memory, self-evolving identity, scheduled routines, skill auto-generation, and channels. Setup is about 5 minutes and it's MIT licensed: GitHub: https://github.com/KybernesisAI/kyberbot Docs: https://kybernesis.ai/kyberbot/docs Happy to answer questions about the architecture or how I built it. submitted by /u/DarkstarVC [link] [comments]
View originalAuto agent - Self improving domain expertise agent
someone opensource an ai agent that autonomously upgraded itself to #1 across multiple domains in < 24 hours…. then open sourced the entire thing but here’s why it actually works: - agents fucking suck, not because of the model, because of their harness (tools, system prompts etc) - Auto agent creates a Meta agent that tweaks your agents harness, runs tests, improves it again - until it’s #1 at its goal - best part: you can set this up for ANY task. in this article he uses it for terminal bench (code) and spreadsheets (financial modelling) - it topped rankings for both :) - secret sauce: he used THE SAME MODEL to evaluate the agent - claude managing claude = better understanding of why it failed and how to improve it humans were the fucking bottleneck and this not only saves you a load of time, it’s just a better way to train them for domain specific tasks https://github.com/kevinrgu/autoagent submitted by /u/Infinite-pheonix [link] [comments]
View originalI used Claude Code to build a portable AI worker Desktop from scratch — the open-source community gave it 391 stars in 6 days
I want to share something I built with Claude Code over the past week because it shows what AI-assisted development can actually do when pointed at a genuinely hard problem: moving AI agents beyond one-off task execution. Most AI wrappers just send prompts to an API. Building a continuously operating AI worker requires queueing, harness integration, and MCP orchestration. I wanted a way to make AI worker environments fully portable. No widely adopted solution had cleanly solved the "how do we package the context, tools, and skills so anyone can run it locally" problem effectively. What Claude Code did: I pointed Claude (Opus 4.6 - high thinking) at the architecture design for Holaboss, an AI Worker Desktop. Claude helped me build a three-layer system separating the Electron desktop UI, the TypeScript-based runtime system, and the sandbox root. It understood how to implement the memory catalog metadata, helped me write the compaction boundary logic for session continuity, and worked through the MCP orchestration so workspace skills could be merged with embedded runtime skills seamlessly. The result is a fully portable runtime. Your AI workers, along with their context and tools, can be packaged and shared. It's free, open-source (MIT), and runs locally with Node.js (desktop + runtime bundle). It supports OpenAI, Anthropic, OpenRouter, Gemini, and Ollama out of the box. I open-sourced this a few days ago and the reaction has been unreal. The GitHub repo hit 391 stars in just 6 days. The community is already building on top of the 4 built-in worker templates (Social Operator, Gmail Assistant, Build in Public, and Starter Workspace). This was so far from the typical "I used AI to write a to-do app." This was Claude Code helping architect a real, local, three-tier desktop and runtime system for autonomous AI workers. And people are running it on their Macs right now (Windows & Linux in progress). I truly still can't believe it. The GitHub repo is public if you want to try it or build your own worker. GitHub ⭐️: https://github.com/holaboss-ai/holaboss-ai submitted by /u/Imaginary-Tax2075 [link] [comments]
View originalAren't they overdoing it?
I really like the generative UI and I think it has great applications, but this is pointless. I asked for "text" and yet I got this. Not only is it pointless but it's more tokens and compute to something that didn't need it. To be fair I only had this happen a few times and every time it has been with Sonnet 4.6 w/ extended thinking off, but still. submitted by /u/Educational-Nebula50 [link] [comments]
View originalCloned the claw-code repo before it went dark - published it, working on making it provider-agnostic
Like many of you, I was trying to clone claw-code and kept hitting 403s. Managed to retrieve the full source and published it here: https://github.com/ghostwright/wraith First commit is the original, completely unmodified. The interesting part for this community: the agent harness is currently locked to one provider. We can work on making it work with any LLM - Claude, OpenAI, Gemini, local models. That's the whole point. Anyone who wants to read the code or collaborate on this, come through. submitted by /u/Beneficial_Elk_9867 [link] [comments]
View originalYes, Harness AI offers a free tier. The pricing model is subscription + freemium + per-seat + tiered.
Key features include: Continuous Delivery GitOps, Continuous Integration, Internal Developer Portal, Infrastructure as Code Management, Database DevOps, Artifact Registry, AI Test Automation, Resilience Testing.
Based on user reviews and social mentions, the most common pain points are: expensive API.
Based on 33 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.