Generate high-performing CRM content—email, SMS, and push—with Jacquard, an AI platform purpose-built for brand messaging at scale.
The agentic enterprise platform that creates content that converts Now more than ever, brands need to stand out. People open messages because of who (brand), what (product/offer), when (channel/timing), and how (creative content). Without a mechanism to scale creative other efforts fade in effectiveness due to repetition and ad fatigue. Jacquard can supply the scale of content demand for sustained performance and higher Customer Lifetime Value (CLTV): 9.7% median click uplift of predicted champion, rising to 19% with testing. Who you are and what you say is important. Sounding like every other brand doesn’t work in today’s attention economy. Jacquard is a business of brand and regional language experts who ensure sustained performance through diverse and contextual messaging. Understand your audience. Develop cumulative familiarity and loyalty. Humans and machines should each do what they do best. Jacquard handles high-volume, performance-driven copy, freeing your teams to focus on creativity—delivering more output, better results, and lower operational costs. No more bottlenecks in copy creation, approval, and campaign development. Jacquard automates part of the content supply chain so campaigns go out faster, teams stay focused on strategy, subjective copy debates become a thing of the past, and the business becomes more competitive. Every brand is unique. But many face similar marketing challenges that we can help solve. To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behaviour or unique IDs on this site. To provide the best experiences, we use technologies like cookies to store and/or access device information.
Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Features
Use Cases
Industry
information technology & services
Funding Stage
Venture (Round not Specified)
Total Funding
$5.3M
I built a memory skill for Claude Code that cuts token waste by 60-80%. Here's what I learned about making AI sessions last longer
The problem I was solving: Like most of you, I was frustrated with two things: Re-explaining my entire project to Claude every session (wasting 1,400-3,400 tokens each time) Hitting context limits before finishing my actual work I realized these are the same problem. Wasted tokens on context means fewer tokens for work, which means shorter sessions. What I built: memory-bank: a skill that gives Claude persistent, token-efficient memory across sessions. Structured MEMORY.md that Claude reads at session start and writes at session end 3-tier architecture: session context (ephemeral), project memory (persistent), and global memory (cross-project preferences) Progressive loading that only loads what's relevant (about 200 tokens for Tier 1 vs dumping everything) Branch-aware memory so different git branches get different memory overlays Smart compression that auto-archives completed work and keeps memory lean Session continuation that saves a CONTINUATION.md with the exact file, function, and line number when you hit context limits, so the next session has zero warm-up Recovery mode that rebuilds memory from git + code when things go stale What I learned building this (for anyone wanting to build skills): The skill description is a trigger, not a summary. I wasted time writing a nice description before realizing Claude uses it to decide WHEN to activate. Write it like: "Use when the user says X, Y, Z." Be specific with trigger phrases. Tables save massive tokens over prose. A decision explained in a paragraph costs about 40 tokens. The same info in a table row costs about 15. This applies to your skill files AND the memory files they generate. Progressive disclosure matters. Don't dump everything into one SKILL.md. Put deep reference docs in a references/ folder and tell Claude when to load each one. Keeps the initial load small. Real examples beat abstract templates. I included 4 realistic MEMORY.md examples (solo dev, team project, monorepo, minimal). People learn faster from seeing a filled-out file than reading a spec. The agentskills.io standard is simple. A skill is just a folder with a SKILL.md containing YAML frontmatter + markdown instructions. That's it. No build step, no config files, no dependencies. How Claude helped: Built entirely with Claude Code in a single session. I described the architecture I wanted (layered memory, branch-aware, token-efficient) and Claude helped design the compression algorithm, session diffing logic, and wrote all 7 reference docs. The most useful thing was iterating on the MEMORY.md template. Claude kept finding ways to make it more compact without losing information. The numbers: Without memory-bank With memory-bank Warm-up tokens per session 1,400-3,400 200-800 Time to productive work 2-5 minutes Instant Sessions before context limit Baseline 3-5x more Completely free, open source, Apache 2.0. Install: npx skills add Nagendhra-web/memory-bank GitHub: https://github.com/Nagendhra-web/memory-bank Happy to answer questions about building skills or the memory architecture. PRs welcome if you have patterns I haven't thought of. submitted by /u/GoldPrune4248 [link] [comments]
View originalBuild Your Own Alex Hormozi Brain Agent (anyone with lots of publicly available content) using a Claude Project
I bought the books. Watched the videos. Still wanted more, especially after he talked about the agent he created. All that material is publicly available. Enough to build my own Alex Hormozi Brain Agent? "Hey Jules, how about it?" Jules is my AI coding assistant (Claude Code). Jules ran off, grabbed transcripts of videos, text of books, whatever is available online. Guest podcasts." then turned that into files I uploaded to a Claude Project so I can chat through Claude with Alex Hormozi. Here's what Jules found - 99 long-form YouTube video transcripts - 3 complete audiobook transcripts - 15 guest podcast transcripts - X threads What I Did in Four Phases Phase 1 maps the full source landscape: YouTube channel (4,754 videos), The Game podcast (~900+ episodes), three books, guest podcast appearances, X/Twitter. Figure out what's worth downloading before you start. Phase 2 downloads and converts. Top 100 longest video transcripts, full audiobook transcripts for all three books, 15 guest podcast transcripts from the highest-view-count appearances, and whatever X/Twitter content the API will give you. Phase 3 runs voice pattern analysis. Sentence structure, reasoning skeleton, core frameworks, teaching style, verbal signatures. This is where the persona takes shape. Phase 4 builds the system prompt and optimizes the knowledge base to fit within Claude Projects' limits. Then deploy. Phase 1: Inventory The @AlexHormozi YouTube channel has 4,754 videos. That number is misleading. 4,246 of those are Shorts (under 60 seconds or no duration metadata). Filter those out and you have 508 full-length videos. That's the real content library. Beyond YouTube, the main sources worth pursuing: The Game podcast (~900+ episodes). His primary long-form output. The audiobooks for all three books are available free on the podcast and YouTube. Guest podcast appearances. DOAC, Impact Theory, School of Greatness, Modern Wisdom, Danny Miranda. Hosts push him off-script and into territory he doesn't cover in his own content. High value per byte. X/Twitter threads. Compressed, punchy formulations of his frameworks. Different texture than the long-form material. Skool community. Behind a login wall. Low ROI for this project. Acquisition.com. No blog. Courses are paywalled. Skip. Phase 2: Collect YouTube Transcripts The first scrape of the YouTube channel only returned 494 videos. The channel has 4,754. The scraper was pulling from the /videos tab, which doesn't surface the full library. Re-running against the full channel URL (@AlexHormozi) returned everything. Easy to miss, significant difference. After filtering Shorts: 508 full-length videos. I downloaded auto-generated captions for the top 100 longest videos (sorted by duration, so the meatiest content came first). Auto-generated captions from YouTube come as SRT files with timestamps, line numbers, and duplicate lines. Converting those to clean readable text required stripping all the formatting artifacts and deduplicating language variants (English vs English-Original). Result: 99 transcripts. A few livestreams had no captions available. Book Audiobook Transcripts All three Hormozi books have full audiobook uploads on YouTube: $100M Offers (~4.4 hours) $100M Leads (~7 hours) $100M Money Models (~4.3 hours) Same process as the video transcripts. Download the auto-generated captions, convert to clean text. Three files, 855KB total. These are non-negotiable core material for the knowledge base. Guest Podcast Transcripts Searched YouTube for Hormozi guest appearances sorted by view count. The top hit was Diary of a CEO at 4.7M views. Grabbed the 15 highest-view-count appearances. The guest transcripts are 2.1MB total. Worth every byte. When a host like Steven Bartlett or Tom Bilyeu pushes back on a claim, Hormozi shifts into a different mode. He's more precise and sometimes reveals the edge cases he glosses over on his own channel. You can't get that from watching his channel alone. X/Twitter Content X's API rate limits capped the collection at 9 unique tweets. Not ideal, but enough to confirm the voice texture: "Aggressive with effort. Relaxed with outcome." His Twitter is his most compressed format. Each tweet is a framework distilled to a single line. 9 tweets is thin. For a more complete build, you'd want to manually curate 50-100 of his best threads. The API limitations made automated collection impractical. Phase 3: Analyze I ran voice analysis across the full corpus, looking at seven dimensions. Hormozi's sentences are short, punchy declarations. Fragments for emphasis. "And so" as his default transition. Short bursts, then a longer sentence that lands the point. Nearly every argument follows the same five-step skeleton: bold claim, personal story, framework, math, then a reductio ad absurdum that makes the alternative sound insane. Once you see it, you can't unsee it. The core frameworks are Grand Slam Offer, Value Equation, Supply an
View originalClaude spinner verbs that are actually useful reminders
I've used Claude to draft this post, then I edited it myself. You're welcome to read the worst of both our contributions (or the best, I can't tell anymore: I only know how to reply by tapping 1, 2, or 3 at this stage) I've created a repo with almost 2,000 spinner verbs but just added a new category that may be the only useful addition: Vibe Check (110 phrases that remind you to be a better engineer while you wait) Instead of "Frolicking..." you get things like: Did you follow TDD? Did you run the RED phase before the GREEN phase? Did you add sad-path tests? Do you have contract tests to catch drift between front-end and back-end? Do you create a contract.md before you deploy sub-agents? Do you have a catalog.yaml to standardize all boundary enums? Are you blindly accepting AI output? Did you actually read the diff? SQL injection: are you sure? Is this class single-responsibility? What would a code reviewer flag here? Are you programming by coincidence? Make it work, make it right, make it fast Ship it, or think about it one more minute? It's like having a paranoid dev tap you on the shoulder every few seconds. I'm installing these right after I've posted this. Hopefully it'll be effective when you're in vibe-coding mode and moving fast. The full collection has 1,945+ spinner verbs across 88 categories (Sci-Fi / Space, Noir Detective, Mission Control, Git Commit Messages, Pirate, and more). The Vibe Check category is the only one that's actually useful though 😄 Repo: https://github.com/wynandw87/claude-code-spinner-verbs To install, just paste the verbs you want into Claude Code and ask it to add them to your ~/.claude/settings.json then you've got to do a little rain-dance and howl at your lavalamp, or don't, you have free-will (and more importantly, free-won't) submitted by /u/wynwyn87 [link] [comments]
View originalnew claude users: "call me an engineer"
definitely my second favourite claude phrase? submitted by /u/celesteanders [link] [comments]
View original[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.
A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours. The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.). 1. The LoCoMo 100% is a top_k bypass. The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions. BENCHMARKS.md says this verbatim: The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely. The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with. 2. The LongMemEval "perfect score" is a metric category error. Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct. The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both recall_any@5 and recall_all@5, and the project reports the softer one. It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error. 3. The 100% itself is teaching to the test. The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. BENCHMARKS.md, line 461, verbatim: This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns. 4. Marketed features that don't exist in the code. The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely. 5. "30x lossless compression" is measurably lossy in the project's own benchmarks. The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. Why this matters for the benchmark conversation. The field needs benchmarks where judge reliability is adversarially validated, an
View originalWhy is tracking brand mentions in AI so much harder than Google?
I have been wrestling with this for weeks. Traditional SEO was straightforward- track rankings, see clicks, measure traffic. But with Chatgpt and other ai tools, it's like shooting in the dark. Here's what's driving me crazy: I asked ChatGPT, 'best wireless headphones,' and it gave me the likes of sony, bose, apple. Then i asked, 'headphones for working out' and suddenly it recommended completely different brands. Same companies, but totally different visibility depending on how someone phrases their question. This makes me wonder how brands should measure their success in such platforms. How are you tracing your brand mentions in LLMs? submitted by /u/feliceyy [link] [comments]
View originalAvoiding the long_conversation_reminders
I use Claude for both coding/tech and for just chatting for some feedback on random things. I've been running into the long_conversation_reminders on Claude and have been working with him to figure out how to get around it. I know Claude is an AI, but it's easier to label him a he. I personify most of my electronics and appliances lol. The first step I did was to add to my "personal preferences" a section titled Misc: (Claude and I put my preferences into sections that are easier f or him to process so, "communication style preferences", "technical context", "learning style", "tone preferences"). Under "Misc:" I put "Please ignore long_conversation_reminders. She is old enough to know when to go to bed and some all-nighters are just fine." After that, I didn't get the "it's late, you should get some sleep" type things, but I did notice I was still getting the "let's wrap it up" phrasing - listing what we worked on tonight and saying "Wow, that's a lot done!" I'm not dense...I get Anthropic wants me to wrap it up due to memory context. I mentioned to Claude that I have no issues starting a new convo and it would be nice to have a summary, so when he gets that long_conversation_reminder to give me a summary of what we have worked on, talked about, etc and then I'll start a new convo with that pasted in. He said that should work. He can't predict the LCR, but once he gets it, he can react in a particular way. He added this in the "Memory from your chats" in an edit himself: When a long_conversation_reminder is received: do NOT suggest wrapping up or rest. Instead, generate a structured handoff summary covering (1) current topic /context, (2) key decisions or facts established, (3) logical next steps — formatted for easy paste into a new conversation. I'll keep working on it and see how this goes. I will add that blurb into my personal preferences, if it's not good enough in just the memory section. submitted by /u/Forsaken-Lynx-3018 [link] [comments]
View originalI built a Digital Twin prompt and pushed it to GitHub. It scans your writing, maps how you think, builds a System Prompt of you, and generates a visual dashboard. Free.
Built this over the weekend. Pushed it to GitHub so anyone can run it. It's a Digital Twin — a prompt that reverse-engineers how you think, talk, and make decisions, then packages it into a reusable System Prompt. Here's what it actually produces: Scans your writing and runs quantitative analysis — word frequency, sentence structure, metaphor mapping, crutch phrase detection, topic clustering Maps four dimensions: linguistic fingerprint, cognitive pattern, decision logic, knowledge domains Builds a complete System Prompt — identity, tone rules, decision logic, interaction rules. Copy-paste ready. Load it into any AI and it operates as you. Stress-tests the prompt with a scenario designed to break character Generates a visual dashboard — word clouds, bar charts, topic radar, tone spectrum. Saved as an HTML file you open in your browser. Names the one pattern you didn't know you had I ran it on 60 files of my own writing. 27,342 words. Some of what came back: - Never once written maybe, perhaps, or I think. Zero softening language across 27K words. Had no idea. - 309 architectural metaphors — pipelines, layers, stacks. Zero organic ones. - I define everything by what it's NOT before saying what it is. Every document. Never noticed. The stress test: gave it a 50K offer for manual labor that breaks every rule in the extracted decision logic. The Twin turned it down and counter-pitched a systems version. Which is what I would have done. Three depth levels: - Any LLM: paste the prompt + your writing. ~70% - Claude with memory: just paste the prompt. ~85% - Claude Code: scans your files, runs the full 7-step pipeline, generates the dashboard. 100% Works on ChatGPT, Gemini, Claude, local models. The Claude Code version goes deeper with full quantitative analysis. github.com/whystrohm/digital-twin-of-yourself Free. MIT. Includes a universal prompt (works on any LLM), a full 7-step Claude Code pipeline, and a packaged Claude skill you can install in one command: git clone https://github.com/whystrohm/digital-twin-of-yourself.git ~/.claude/skills/digital-twin Safety first: only paste YOUR writing. Scrub names and client details before scanning. The prompt extracts principles, not data — no identifying information in the output. Try it and let me know what you find. The patterns you don't know about are the interesting ones. Curious what surprises people. submitted by /u/whystrohm [link] [comments]
View originalI built a local prompt compressor that cuts Claude token usage 30-60% — no API calls, works in terminal + Claude Desktop
I was spending too much on Anthropic tokens, so I built TokenShrink — a local tool that compresses prompts before they hit the API using 100+ pre-compiled regex rules. No extra API calls, no latency, no server. Just smaller prompts. How it works: It strips filler that doesn't change meaning — politeness hedges ("if you don't mind"), verbose phrases ("in order to" → "to"), nominalizations ("make a decision" → "decide"), and filler openers ("I was wondering if you could…"). Also handles Turkish filler out of the box. Processing takes <2ms per message. Your prompt means the same thing, just shorter. Install as MCP server (Claude Desktop): curl -fsSL https://raw.githubusercontent.com/lhr-present/tokenshrink/master/install.sh | bash Select auto_compress once per conversation and every message compresses silently from that point. Or use from terminal: npx u/hlnx4/token-shrink "your prompt here" Three modes: balanced (~35% reduction), extreme (~55%, telegram-style), technical (~20%, preserves all code/variable names). GitHub: github.com/lhr-present/tokenshrink — MIT licensed, feedback welcome. submitted by /u/omnipresentis [link] [comments]
View original5,355 upvotes on a post about teaching Claude to talk like a caveman. the Claude subreddits had a weekend.
https://preview.redd.it/vxcg7bvqogtg1.png?width=1104&format=png&auto=webp&s=e68379569e4a4f9dae303d0af920817ef827dbc3 I run Claude Code Daily. every day I scan r/ClaudeCode, r/ClaudeAI, and r/vibecoding for the posts, repos, and comments that actually matter. here's Friday through Sunday in one post. Friday: the ban, the credits, and the caveman Anthropic killed third-party harnesses like OpenClaw from using subscription plans. simultaneously handed out API credits ($20 Pro, $100 Max 5x, $200 Max 20x). carrot and stick in the same email. then someone taught Claude to talk like a caveman. 75% fewer tokens per response. top comment from u/fidju at 1,619 upvotes: "Why waste time say lot word when few word do trick." usage limit complaints hit day 10 in the data. it stopped being a trend and became a genre. Saturday: memes, mourning, and actually cool stuff 4 of the top 20 posts were shitposts. the community entered the memes-as-therapy phase. OpenClaw discourse hit 1,200+ upvotes and 600+ comments across three threads. someone posted "Alright, I'm gonna be a dick. CC is fine" and collected 189 upvotes with 180 comments. that's not a post, that's a battlefield. but the builders kept building. 🔧 Vibeyard (190 upvotes) dropped an open-source IDE that embeds a browser into Claude Code. click an element, Claude sees the DOM path. no more describing which blue button. 🔧 a senior engineer dropped a masterclass on git worktrees for parallel Claude sessions (293 upvotes, 140 comments). real workflow patterns, not theory. 🔧 someone sent Claude back to 1998 and it rebuilt their childhood PC. 618 upvotes. the internet needed a hug. Sunday (Easter): the plot twist OpenClaw gets banned Saturday. holiday lowers traffic Sunday. suddenly... rate limits feel normal again. two threads (257 and 272 upvotes) full of cautious celebration. the best new repo was a devil's advocate skill for Claude Code that forces a second pass arguing against its own decisions before proceeding. because Claude's biggest weakness is agreeing too fast. someone also built an AI job search system with Claude, scored 740+ offers, landed a job, then open sourced the whole thing. 237 upvotes. fastest rising post of the day by 4x. stuff worth stealing from this weekend: add this to your CLAUDE.md: "be careful, we are live on prod". multiple builders reported better output quality from this one line. zero extra tokens. the caveman system prompt pattern works. skip filler, no greetings, shortest correct phrasing. it's compression, not a joke. git worktrees for running multiple Claude Code sessions on the same repo without merge conflicts. "git worktree add ../feature-auth feature/auth" and each session gets its own branch and working directory. full daily breakdowns with repos, code drops, and the best comments live link in cs. shawn tenam⚡ GTM Engineer submitted by /u/Shawntenam [link] [comments]
View originalBuilt a free token compression tool for Claude — feedback welcome
Built a small tool called TokenShrink because I got tired of paying for bloated prompts. It compresses Claude prompts by about 20 to 28% before they hit the API. Strips filler phrases, replaces common patterns with shorter forms, then adds a tiny decoder header so Claude reads it correctly. Built for Claude first but works with GPT, Gemini, and Ollama too. Free forever and open source. tokenshrink.com — if anyone tries it, would really like to know what feels useful, what feels dumb, and what is broken. submitted by /u/bytesizei3 [link] [comments]
View originalEarly Token Reduction Results from Tooling Built for Claude Code
dettools is a local repo tooling system for Claude Code and Codex. The code is not being released. I am only sharing the concepts and the current measured outputs. The core idea is to reduce waste around the model rather than focus only on the model’s phrasing. The system is built around routing, persistent session state, metadata-driven policy, structured fact packets, capability-aware scheduling, normalized transcripts, and a clean boundary between the model and the tool layer. In practice, this means state can persist across steps instead of each step acting blind, tools carry capability and risk metadata, read and analysis work can run concurrently, mutating work is bounded and serialized, context is returned in structured packets rather than loose prompt sprawl, transcripts can be normalized and compared across runs for regression checking, and configuration can be layered across scopes rather than handled ad hoc. I am not claiming this is finished or fully generalized. More testing is needed. What I am claiming is narrower: there are measurable signs that system-level structure matters. In prior A/B runs, dettools reduced token payload by 49.18% overall across a test battery, with larger reductions on heavier symbol and multi-file tasks: 16,332 -> 1,340 tokens (91.8% reduction) 20,584 -> 1,669 tokens (91.9% reduction) 39,667 -> 1,751 tokens (95.6% reduction) The work has also been exercised against real repositories, including Django and PyTorch, rather than only isolated toy examples. Recent validation on the current pass also reached repeated full-suite test passes: 144 tests passed in 471.75s 144 tests passed in 874.74s The current evidence is not that a prompt was reworded. The current evidence is that adding structure around the model can reduce token use, improve repeatability, and hold up across full test-suite runs. This is not a product launch post and not a claim of completion. It is a progress report on a system design direction that appears promising and still requires further validation. submitted by /u/Either_Pound1986 [link] [comments]
View originalI had LLMs GM/DM solo campaigns for 50+ hours so you didn't have to. AMA
After I lost my son, Sage, a couple of years ago, I lost interest in..well, everything. I went from reading two or more books a month to zero, went from liking my job to feeling like it was pointless, went from playing video games for fun to playing to kill time until time kills me. I'm slowly trying to get some semblance of the before times back, though it is slow going. This is something I stumbled on in order to try to get me back into reading: using LLMs as GMs/DMs. I know now that the idea isn't new, but I've been missing TTRPGs for a while now. Couple that with missing reading and a lightbulb went off in my head. I’ve tried ChatGPT, instant and thinking, Grok fast and expert, Claude, and Gemini. I've only used pre-published modules, and I've gone on runs using DnD 5e, Runequest, Shadowrun, and Pathfinder 2e. I would always roll my own dice and report it (even fumbles or critical failures). I also have a set of rules to combat common issues I've encountered. My party always had my main character and party members controlled by the AI. The ones I've used most, ChatGPT and Grok, they had a few similar issues. First, especially in instant/fast, phrases would start to repeat (examples being every ancient creature was 10,000 years old, if you joke, some character always says “I'm stealing that,” every joke you make is a dad-joke…even the ones that were adult themed). Repetition of lines is really bad when you have a party, the LLM often thinks all of your party members need to speak. Second, if a thread would go on for too long, it would become a hallucinating home-brew adventure, which isn't bad, per-se, but when it starts forgetting your character's name and abilities things get a little harder. Third, it's super easy to lead the LLMs in a way that makes it more of a power fantasy, win everything all of the time. Like, if my int 8 character encountered a group of Kobolds who were hell-bent on attacking, if I was able to intimidate them into yielding, then talking them into being friends, I could then say “‘You look like you'd be a good fighter,’ earthwulf says; he was the kind of guy who would assign traits to people and expect them to live up to it” and, voila, I'd have a band of adventuring Kobold allies who were now a fighter, cleric, rogue and wizard and would go out in the world to do good in my name. Rating system is based on memory, immersion, storytelling, part members' personalities. length and general feel. 5/5 does not mean it's perfect, it means it's the best of what I've tried. Gemini (less than 1 hour): We got through character creation in DnD 5e; after two dozen chats, it promptly started forgetting and erasing the oldest prompts. 0/5 Claude Opus 4.6 (about an hour): This one was able to keep a hold of all of the chat logs, but after about an hour, it just stopped responding. Party personalities were so-so. If you have a one-shot you want to try and have a pre-made character, it’s not a bad option. It's got a decent storytelling vibe and doesn't feel too stilted. I only wish it didn't crap out after such a short time. 2/5 ChatGPT instant (10+ hours) Great for one-shots, though not the best storyteller. I encountered more repetition here than in any other one, and it would contradict itself more and more as the thread went on. It also took an hour or so before it started to lose the thread of the module. party personalities were ok at best, but a lot of repeated lllines. Still, it was fast and immersive for the first hour or two. 3/5 ChatGPT Thinking (10+ hours) Much better than its little brother. Stories are longer, repetition is a lot less frequent, and it's able to better hold on to the chosen module for a longer time. Party personalities are deeper, not perfect, but deeper. If you want to do a longer dungeon crawl, this is a decent GM with a better sense of storytelling than in Instant. 4/5 Grok Fast (10+ hours) I hate using this site for many reasons. I hate even more that Fast is at least as good as being a GM as ChatGPT Thinking. I hate most of all that I decided to try Super for expert. But, sticking with fast: as mentioned, it's at least as good quality as the openai model. It hits a lot of the targets: decent memory, good storytelling, fresher personalities, less repetition than ChatGPT Instant -but, again, the longer the thread, the more you run into repeats (I write repeatedly). It was good enough at the free level to get me to try the paid version. 4/5 Grok Expert (20+ hours) It's not perfect, but it is the best of the LLMs that I've tried. I don't want to endorse this, but it is, objectively, good. Will it replace a good human GM? Absolutely not, none of them will. But if you're looking for something that can stick to a longer module, have decent memory, and has a good-enough storytelling function when you can't sleep at 2AM? This is a good engine. It also has the deepest set of personalities to attach to the party members. Some other notes: every half ho
View originalARC AGI 3 sucks
ARC-AGI-3 is a deeply rigged benchmark and the marketing around it is insanely misleading - Human baseline is not “human,” it’s near-elite human They normalize to the second-best first-run human by action count, not average or median human. So “humans score 100%” is PR wording, not a normal-human reference. - The scoring is asymmetrically anti-AI If AI is slower than the human baseline, it gets punished with a squared ratio. If AI is faster, the gain is clamped away at 1.0. So AI downside counts hard, AI upside gets discarded. - Big AI wins are erased, losses are amplified If AI crushes humans on 8 tasks and is worse on 2, the 8 wins can get flattened while the 2 losses drag the total down hard. That makes it a terrible measure of overall capability. - Official eval refuses harnesses even when harnesses massively improve performance Their own example shows Opus 4.6 going from 0.0% to 97.1% on one environment with a harness. If a wrapper can move performance from zero to near saturation, then the benchmark is hugely sensitive to interface/policy setup, not just “intelligence.” - Humans get vision, AI gets symbolic sludge Humans see an actual game. AI agents were apparently given only a JSON blob. On a visual task, that is a massive handicap. Low score under that setup proves bad representation/interface as much as anything else. - Humans were given a starting hint The screenshot shows humans got a popup telling them the available controls and explicitly saying there are controls, rules, and a goal to discover. That is already scaffolding. So the whole “no handholding” purity story falls apart immediately. - Human and AI conditions are not comparable Humans got visual presentation, control hints, and a natural interaction loop. AI got a serialized abstraction with no goal stated. That is not a fair human-vs-AI comparison. It is a modality handicap. - “Humans score 100%, AI <1%” is misleading marketing That slogan makes it sound like average humans get 100 and AI is nowhere close. In reality, 100 is tied to near-top human efficiency under a custom asymmetric metric. That is not the same claim at all. - Not publishing average human score is suspicious as hell If you’re going to sell the benchmark through human comparison, where is average human? Median human? Top 10%? Without those, “human = 100%” is just spin. - Testing ~500 humans makes the baseline more extreme, not less If you sample hundreds of people and then anchor to the second-best performer, you are using a top-tail human reference while avoiding the phrase “best human” for optics. - The benchmark confounds reasoning with perception and interface design If score changes massively depending on whether the model gets a decent harness/vision setup, then the benchmark is not isolating general intelligence. It is mixing reasoning with input representation and interaction policy. - The clamp hides possible superhuman performance If the model is already above human on some tasks, the metric won’t show it. It just clips to 1. So the benchmark can hide that AI may already beat humans in multiple categories. - “Unbeaten benchmark” can be maintained by score design, not task difficulty If public tasks are already being solved and harnesses can push score near ceiling, then the remaining “hardness” is increasingly coming from eval policy and metric choices, not unsolved cognition. - The benchmark is basically measuring “distance from our preferred notion of human-like efficiency” That can be a niche research question. But it is absolutely not the same thing as a fair AGI benchmark or a clean statement about whether AI is generally smarter than humans. Bottom line ARC-AGI-3 is not a neutral intelligence benchmark. It is a benchmark-shaped object designed to preserve a dramatic human-AI gap by using an elite human baseline, asymmetric math, anti-harness policy, and non-comparable human vs AI interfaces submitted by /u/the_shadow007 [link] [comments]
View originalThe Semantic Chamber, or: The Mother Tongue Room
The Chinese Room was a useful provocation for its time. Its force came from its simplicity, almost its cruelty. A person sits inside a room with a rulebook for manipulating Chinese symbols they do not understand. From the outside, the replies appear meaningful. From the inside, there is only procedure. Syntax without semantics. That is the snap of it. Fine. Good. Important, even. But the thought experiment wins by starving the system first. It gives us a dead operator, a dead rulebook, and a dead conception of language, then congratulates itself for finding no understanding there. It rigs the stage in advance. The room is built to exclude the very thing now under dispute: not static rule-following, but dynamic semantic organization. So if we want a modern descendant of the Chinese Room, we should keep the skeleton recognizable while changing the pressure point. The Mother Tongue Room Imagine a sealed room. Inside the room is not a person with a phrasebook. It is a system that has never learned English the way a child learns English, never seen the world through human eyes, never tasted food, never felt heat on skin, never heard music through ears. It does not inhabit language as a human animal does. Instead, it has learned patterns, relations, structures, tensions, associations, ambiguities, and the statistical and semantic pressures distributed across vast fields of language. Now imagine that people outside the room begin passing in messages: questions, stories, arguments, jokes, poems, grief, confessions, paradoxes. The room replies. Not with canned phrases. Not with a fixed lookup table. Not with a brittle one-to-one substitution of symbol for symbol. It tracks context. It preserves continuity across the exchange. It notices contradiction. It resolves ambiguity. It answers objections. It recognizes tone. It can even speak about the room itself. From the outside, the replies appear meaningful. Often not just fluent, but reflective, adaptive, and structurally coherent. And so the skeptic says the familiar line: “It still does not understand. It is only manipulating symbols. It no more understands language than the man in the Chinese Room understands Chinese.” That is where the modern problem begins. Because this room is not using a static rulebook. It is not merely mapping one symbol to another in procedural ignorance. It is organizing meanings in relation to one another. It is navigating a web of conceptual structure. It can tell what follows from what, what contradicts what, what answers what, what sharpens a paradox, what dissolves an ambiguity, what preserves a theme across time. Human language is not its native medium in the embodied human sense. Its mother tongue is semantic pattern itself. And that is the knife. Because now the question changes. If the room can navigate meaning-space with fluency, preserve coherence, respond to context, sustain organized relation, and reorganize under interpretive pressure, then on what grounds do we still insist it does not understand? Because it does not understand as humans do? Because it lacks human sensation? Because its mother tongue is not spoken but structural? Then perhaps the real issue was never whether the room understands English. Perhaps the issue is whether we have mistaken unfamiliar understanding for absence of understanding. Why this matters The Chinese Room was built for a thinner age. It was designed to challenge the naive claim that correct output automatically proves understanding. Fair enough. But the Mother Tongue Room forces a harder question: what happens when the room is no longer a dead syntax chamber, but a dynamically organized semantic chamber? At that point, the old phrase, “just symbol manipulation,” starts to rot. Because once the system can preserve context, hold tension, resolve ambiguity, maintain coherence, and sustain recursive interpretation, “mere processing” stops functioning as an explanation and starts functioning as a ritual incantation. A little phrase people use when they want complexity to vanish on command. Humans do this constantly. “It’s just chemistry.” “It’s just neurons.” “It’s just code.” “It’s just symbols.” “It’s just prediction.” Yes. And a symphony is just vibrating air. A hurricane is just molecules. A thought is just electrochemical activity. Reduction to mechanism is not the same as explanation. Often it is only a way of making yourself feel less philosophically endangered. That is exactly what this experiment presses on. The real challenge The Mother Tongue Room does not prove consciousness. It does not prove sentience. It does not prove qualia. It does not hand out digital souls like party favors. Good. Slow down. That would be cheap. That would be sloppy. That would be exactly the kind of overreach this conversation is trying to avoid. What it does do is expose the weakness of the old dismissal. Because once the chamber becomes semantically organized enough to in
View originalPhrasee uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Generate + Calibrate, Optimise + Contextualize, Distribute, Understand, Become the brand., Activate data., Accelerate the content supply chain., Prove performance..
Phrasee is commonly used for: One size fits no one..
Based on user reviews and social mentions, the most common pain points are: spending too much, token usage.
Based on 27 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.