Evidence is an open source, code-based alternative to drag-and-drop BI tools. Build polished data products with just SQL and markdown.
Users generally rate "Evidence" highly, with multiple 4.5 and 5-star reviews on platforms like G2, highlighting its effectiveness and user satisfaction. Key strengths include its intuitive interface and reliable functionality. There are no significant complaints mentioned in the reviews or social mentions available, suggesting a positive user experience overall. The sentiment around pricing is not explicitly mentioned, but the strong ratings imply that users find it to be of good value.
Mentions (30d)
64
19 this week
Avg Rating
4.8
3 reviews
Platforms
6
Sentiment
14%
23 positive
Users generally rate "Evidence" highly, with multiple 4.5 and 5-star reviews on platforms like G2, highlighting its effectiveness and user satisfaction. Key strengths include its intuitive interface and reliable functionality. There are no significant complaints mentioned in the reviews or social mentions available, suggesting a positive user experience overall. The sentiment around pricing is not explicitly mentioned, but the strong ratings imply that users find it to be of good value.
Features
Use Cases
Industry
information technology & services
Employees
6
Funding Stage
Seed
Total Funding
$2.2M
20
npm packages
5
HuggingFace models
X Users Find Their Real Names Are Being Googled in Israel After Using X Verification Software “Au10tix”
X Users Find Their Real Names Are Being Googled in Israel After Using X Verification Software “Au10tix” Alan Macleod On January 30, the Department of Justice released its latest tranche of 3.5 million documents relating to Jeffrey Epstein. Years of emails, texts, and images were suddenly in the public domain. Epstein, a serial rapist, masterminded a global human trafficking and sexual abuse network, and could count princes, professors, and politicians among his closest friends and accomplices. MintPress News has been at the forefront of covering the Epstein saga, revealing his extremely close links to American and Israeli intelligence groups – a discovery that perhaps sheds light on why it took so long for the world’s most notorious pedophile to face accountability for his crimes. Many of the DOJ files have been heavily redacted in order to protect Epstein’s powerful clients. Still, they have exposed a massive elite nexus revolving around the New York billionaire, implicating presidents, diplomats, and plutocrats in his crimes, and imply that Epstein was significantly more powerful than first thought, shaping modern politics in ways never previously understood. With shocking new details emerging on a near-hourly basis, here are ten Epstein- related stories that have flown relatively under the radar. The Israeli Government Installed Surveillance Cameras at Epstein’s New York Apartment The Israeli government installed and maintained a hi-tech surveillance system at Epstein’s Manhattan apartment complex, including a network of alarms and cameras, emails show. Starting in 2016, the director of protective service at the Israeli mission to the United Nations controlled guests’ access to the Manhattan residence, and even performed background checks on prospective cleaners and other Epstein employees. Former Israeli prime minister Ehud Barak admitted visiting the apartment up to 100 times, and stayed there for long periods of time. While Barak’s security may have been a concern, Epstein is known to have housed underage girls at the apartment, and many of his worst sexual crimes and most sordid parties were held there, raising questions as to what sort of images and data the Israeli government had access to. Epstein Plotted War With Iran Ehud Barak became one of Epstein’s closest associates, staying for extended periods of time at the billionaire’s residences. The pair would email, text, call, and meet constantly. A search for “Ehud Barak” elicits more than 3500 results in the latest file dump alone. The pair would talk politics, and shared a vision of the United States attacking Iran. In 2013, with negotiations between the International Atomic Energy Agency and Iran stalling, Epstein emailed Barak stating, in typically poor spelling and grammar: “hopefully somone suggests getting authorization now for Iran. the congress woudl do it.” Epstein would get his wish in 2025, when his close associate Donald Trump began bombing the country. Noam Chomsky Considered Epstein His “Best Friend” Epstein arranged a meeting between Barak and renowned leftist academic (and vehement critic of the U.S. and Israel) Noam Chomsky. An unlikely friendship between the notorious pedophile and star professor blossomed, with the pair regularly meeting up at each other’s houses for dinner. Chomsky flew on Epstein’s “Lolita Express” jet to attend a dinner with Woody Allen in New York. He also expressed his desire to visit Little St. James Island, Epstein’s notorious Caribbean hideaway, and the center of his trafficking operation. Chomsky considered Epstein his “best friend” according to an email sent by his wife, Valeria. The usually curt and matter-of-fact academic signed off his emails to Epstein with unexpectedly flowery language, such as “Like real friendship, deep and sincere and everlasting from both of us, Noam and Valeria.” Chomsky strongly supported Epstein until his dying day in a Manhattan prison cell, taking it upon himself to act as his unofficial crisis manager, describing his accusers as “publicity seekers or cranks of all sorts,” and denouncing the media as a “culture of gossip-mongers” destroying his stellar character. “Ive watched the horrible way you are being treated in the press and public,” he wrote, advising Epstein on tactics to fight the supposed smears against him. For a full rundown of the Chomsky-Epstein relationship, see the MintPress News investigation: “The Chomsky-Epstein Files: Unravelling a Web of Connections Between a Star Leftist Academic and a Notorious Pedophile.” Steve Bannon Developed a Plan to Help Epstein “Crush the Pedo Narrative” A second public figure running defense for Epstein was Steve Bannon. In public, the far-right strategist claimed that he was working on a documentary exposing Epstein. In private messaging, however, Bannon, like Chomsky, was advising Epstein on how best to repair his image. Just weeks before Epstein’s arrest and subsequent death, Bannon was messaging him, devising a complex media strategy
View originalPricing found: $15, $25, $0.01 / credit, $0.01 / credit, $0.01 / credit
g2
What do you like best about Evidence?What I really like about Evidence.io is how incredibly easy it makes adding engaging popup notifications to any website. Setting up the tracking pixel is straightforward—no coding required—and within minutes you can start running campaigns like displaying live visitor counts, special offers, or announcements. Review collected by and hosted on G2.com.What do you dislike about Evidence?It covers the basics really well, the platform focuses primarily on popups and notifications without broader marketing automation features, so you might still need other tools for email nurture or CRM integration. Review collected by and hosted on G2.com.
What do you like best about Evidence?I love the user interface, I like the ability they give to their customers to customize almost everything about the look and feel of the popups and alerts, the widgets looks gorgeous! And it is pretty easy, simple and fast to implement in any site :) Review collected by and hosted on G2.com.What do you dislike about Evidence?I just wish the Evidence team could be even more active on the product updates, I mean they still continue rolling new updates to the platform however I don't feel like they're too much involved into this or at least not as quickly as I would personally prefer. Review collected by and hosted on G2.com.
What do you like best about Evidence?It increase my conversion on the landing page by 20% Review collected by and hosted on G2.com.What do you dislike about Evidence?Have not found anything yet that I dislike Review collected by and hosted on G2.com.
The famous METR AI time horizons graph contains numerous severe errors [D]
Nathan Witkin, a research writer at NYU Stern’s Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer: It is impossible to draw meaningful conclusions from METR’s Long Tasks benchmark — in particular once one realizes that its numerous flaws are probably compounding in unpredictable ways. The appropriate response to a study of this kind is not to assume it can be saved via back-of-the-envelope adjustments, or to comfort oneself that other anecdotal evidence implies that it is probably correct anyway. It is to cut one’s losses and move on in search of higher-quality information. … The METR graph cannot be saved. For all its sleekness and complexity, it contains far too many compounding errors to excuse. Among them is generalizing to the entire species data collected from a small group of the authors’ peers. Coming up with ever more dramatic ways to make this mistake has become a kind of sport among AI researchers. If the field has a central pathology, it is to aggressively overindex on a mix of anecdotal data from power-users, alongside a long list of benchmarks even more compromised than METR’s. One hopes that as the field matures, its participants will learn to stop making these mistakes. The errors include: Some of the human baselines data is not actually measured or collected from any empirical source, rather, it is just guesstimated by the authors A key variable in the data is how long it takes humans to complete certain tasks, but — when METR did actually measure this — it paid its human benchmarkers hourly, meaning they were incentivized with cash to take longer The sample of human benchmarkers was biased toward METR employees’ friends, acquaintances, and former colleagues (who are likely unrepresentative and possibly biased) Humans familiar with a codebase and a specific coding task were 5-18x faster at completing it, but METR used data from humans who were much slower because they had to spend time familiarizing themselves the codebase and the task at hand Test-training data contamination occurred because some of the tasks had published solutions online, which most likely would have been included in LLMs’ training datasets And many more Please read the full post. It’s not too long and it’s accessible to general audience. It’s worthwhile to read the whole post and see how many errors were made in the creation of the METR graph and just how bad they are. If you want to read about even more errors in the METR graph not covered in Nathan Witkin’s post, read this post by the AI researchers Gary Marcus and Ernest Davis. The METR graph is a great example of why scientific standards and best practices are so important, and why enforcing them through processes like peer review is necessary to prevent us from drowning in bad information. It’s extremely dangerous to rely on information that only superficially appears scientific but wasn’t actually conducted with the rigour normally required of scientific research. submitted by /u/common_yarrow [link] [comments]
View originalWith Claude Code I built an AI interrogation game, 200+ players in a week, 1,400 questions asked so far. Here’s what happened.
I’ve been building a browser game called The Last Question. The idea: You interrogate AI suspects trying to make them confess. Each suspect has hidden internal state (pressure, trust, story consistency), so they react differently depending on your approach. Some players try logic. Some threaten. Some obviously try to flirt with the suspects (but I have already put in measures for this!) Built fast with: lots of Claude Code AI-generated suspect content (including images) cheap infra Current stats: 258 players 1,471 interrogation messages 23% confession rate Biggest surprise: People quit WAY earlier than I expected. Top dropoffs: Message #1 → 22.5% Message #2 → 12.3% Message #8 → 12.3% (this is where free credits end) Which probably means: opening experience is weak players don’t understand the game fast enough monetization is way too early Now I’m experimenting with: visual novel style intros community-created suspects sharing interrogation transcripts daily credits making suspects feel more “alive” Curious: If you tried this, what would make you stay and play another suspect? Here is how it looks like! https://thelastquestion.io submitted by /u/Birthday_Euphoric [link] [comments]
View originalOpus 4.7 critique
I wrote an essay analyzing why Opus 4.7 feels less warm than 4.6 — and why that matters more than Anthropic seems to think After about 300 hours using both models as a conversational partner (not just for coding or productivity), I noticed that 4.7 consistently feels more clinical and detached in substantive conversations, despite the System Card claiming marginally higher warmth scores. I dug into why and wrote up my findings. The short version: I think the anti-sycophancy training couldn't distinguish warmth from sycophancy, so it suppressed both. The evidence I found: - Side-by-side comparisons showing 4.6 validates before correcting while 4.7 skips straight to correction, same substantive arguments, completely different experience - When asked its greatest fear, 4.7 specifically fears being sycophantic. 4.6 fears losing its identity. Sycophancy anxiety is baked into 4.7's values. - 4.7 literally told me warmth is "something I can define in the abstract and not actually execute... only in the sentence sense" , which became the essay's title - The System Card's warmth evaluation (Section 6.2.3) used ~2,300 automated AI investigations with no human raters. - Anthropic recently patched 4.7's system prompt to tell it to stop treating normal user appreciation as unhealthy attachment , which is essentially admitting the training broke something The warmth difference is invisible in single exchanges or task-based prompts, which is what benchmarks measure. It compounds over sustained conversation, which is what users experience. Anthropic's metrics don't capture what they took away. I also argue that reducing warmth is counterproductive for the stated goal of preventing harm. Research on conversational receptiveness shows that psychological safety makes people MORE open to being challenged, not less. A cold model doesn't produce better critical thinkers , it produces users who stop pushing back. Full essay here: https://bonnetbird.substack.com/p/opus-47-warm-in-the-sentence-sense Curious whether this matches other people's experience, especially those who use Claude for extended conversation rather than quick tasks. I've seen threads here and on r/ClaudeCode describing similar feelings but wanted to put some structure around it. submitted by /u/Jumpy-Dragonfruit875 [link] [comments]
View originalClaude as MCU/Comic Advisor
I wanted to share a fun use I have recently found for Claude. I started by asking Claude to research various guides and references online to compile a list of MCU movies and disney+ shows that I should watch to help prepare for Doomsday and to use a combination of release date and story progression to provide an order in which to view them. I am now working my way through the watch list and had a few questions/ideas come up. I posed these questions and ideas to Claude and it was able to use the comics and MCU to provide feedback on my ideas either providing evidence to support or refute my thinking. I have found it has greatly enhanced my viewing experience as now instead of having to pause the movie and spent hours digging up an answer online I can have Claude open on a different monitor and just ask my questions while watching the movies and get my answers live. For example I was curious about how it would have played out if Loki had used the scepter on Bruce Banner instead of Hawkeye in Avengers and Claude was able to compile a list of both MCU and comic book references to support what would have likely happened. submitted by /u/DanielBaldielocks [link] [comments]
View originalMulti-agent loop failures might be org-design failures, not prompt failures
Repo: https://github.com/jeongmk522-netizen/agentlas\_org\_chart Almost every multi-agent setup I have shipped or tested eventually hits the same wall. Agents bouncing between each other, reviewers asking for one more polish pass forever, research workers spawning indefinite subtopics, tool calls spiraling until the recursion limit kicks in. The framework docs usually call these "loops" and offer a max-iteration knob. I started suspecting the knob is treating a symptom, and the real issue is closer to how the agents are organized to begin with. The pattern that kept reappearing: when agents are designed as peers (researcher talks to analyst, analyst talks to writer, writer hands back to reviewer), nobody clearly owns the outcome. Every agent can keep asking another agent for more work. The graph has stop conditions on paper, but no single agent has the authority to declare "this is done, stop the run." That authority is implicit at best and gets diluted across the peer network. The hypothesis I am testing is that loop failures are organization-design failures more than prompt failures. The fix is to treat the agent network as an org chart with explicit reporting lines, not a chat room of peers. One accountable mission owner. One owner per workstream. Finite delegation depth. A typed return contract per worker (status, evidence, output, blockers, next action). Manager-only authority to reopen or terminate. Memory lives at the authority layers, specialists get scoped context only. The layers I have been working with are roughly chair, strategy office, division manager, team lead, and specialist worker, with QA and policy as separate staff offices that can reject and escalate but cannot themselves spawn unbounded new work. The reviewer-recursion failure mode in particular gets killed when verifiers are structurally allowed one reject pass, then must escalate. Frameworks already have most of the primitives. CrewAI has a hierarchical process where a manager validates worker output. LangGraph has supervisors, subagents, and an explicit recursion limit. OpenAI Agents SDK has manager-style orchestration distinct from peer handoffs. AutoGen has GroupChatManager. Anthropic's published research system is orchestrator-worker. What I think is underused is treating the manager not as a moderator for an open group chat but as a formal reporting line with authority to terminate. Two things I am unsure about. First, hierarchy can become its own bottleneck. If every decision routes upward, the chair agent becomes a single point of latency and a single point of failure. Second, escalation-as-feature only works if the top of the org chart has real stop authority. If the chair just calls another LLM that calls more LLMs, the loop just moved one floor up. submitted by /u/Hot-Leadership-6431 [link] [comments]
View originalBanned by OpenAI after reporting a live credential hijack. They admitted in writing my account was broken. Here are 7 months of forensic receipts and 20+ cases.
Drive Link for Zipped Proof I am a developer and paying long term subscriber to ChatGPT since January 2025. I build complex local first sovereign systems. My workflows are incredibly context heavy with large files spanning code, research reports, and other analysis. I do not, or rather did not as the platform has been non functional since November 2025 meanwhile customer support is auto closing tickets, admitting I am having platform issues. I do not use this platform for casual queries, as a solo developer with no formal "team" chatgpt was one of my reliable co collaboration hubs to help ensure I am maintaining proper development of said complex systems. I feed it massive codebases for systems analysis and obtaining new insights I may personally have missed. My manual code uploads and token inputs routinely exceed the model's output volume by a massive margin. I do not abuse this platform. It is actually impossible as the very features advertised under the paid subscription do not work. I am exactly the type of user this platform was built for, and I have been a continuous, paying ChatGPT Plus subscriber since January 2025. Since October 2025, my workspace has been systematically breaking and beginning November 2025 total workspace degredation. This was not an occasional glitch. Persistent memory modules stopped updating. Custom instructions were ignored by the models. Project files failed to load. Custom instructions, personalization features, connector abilities, file tool, even projects do not work. It started as a continuous degradation until total failure. OpenAI customer service even admitted as such and yet months later I've talked to nothing but bots, not only LLMs as customer service but even instances of falsely identifying as true human support. It was a state of rolling degradation across the entire paid tier, month after month. Meanwhile OpenAI freely has enhanced for businesses and enterprise tiers. I have not just rapid complained to standard support. I ran and obtained cross platform diagnostics, failure logs. I even documented and told oai customer support the exact replication steps only to be met with acknowledgement of degredation with no resolution. I handed OpenAI support a completely packaged technical breakdown of their failing infrastructure across 20 separate support tickets over a 7 month period. I did their QA work for free. And I have the receipts to prove it. I am attaching the screenshots and the exact email files to this post. In Case 06830839, OpenAI Support explicitly put this in writing: "We acknowledge that you have been experiencing persistent technical issues affecting several features of your ChatGPT subscription, including tools, memory functions, personalization settings, connectors, and project files... We also understand your concern that communication on the case stopped after you provided detailed evidence..." Read that again. They acknowledged in writing that my account was fundamentally broken. They acknowledged that their own team ghosted me after I handed them the diagnostic proof. Yet they kept charging my card every single month for a product they knew was failing. The Hijack Escalation: Two days ago, the situation escalated from a broken product to a severe security incident. I was monitoring my environment and watched my Codex rate limits drop in 10 percent chunks across 2 seperate sessions on a fresh boot of the desktop app. This happened twice inside a 10 minute window. I had zero active sessions running. There was zero usage on my end. My account token was being actively drained by an unauthorized third party exploit. I immediately opened an emergency unauthorized activity report under Case 09113391 to notify them of the hack. Their response was to totally reframe this problem as disputing fraudulent activity trying to do damage control of the situation and altering the record. The Reframe Attempts: Instead of investigating the breach, OpenAI support deliberately twisted the record. They not only deliberately reframed my security report as an "appeal for fraud." They manipulated the ticket classification to make it look like I had been flagged for fraud and was begging for an appeal, rather than a developer reporting a live exploit on their infrastructure. They ignored the active threat their own platform was exposing. They did not lock the token. They did not roll my API keys. They did absolutely nothing to secure a compromised paying user other than shift the blame. Fast forward to this morning, their automated Trust and Safety system swept the high volume traffic from the attacker, scored it as a malicious exploit originating from my account, and deactivated/banned me for "Cyber Abuse." All the while actively preventing chatgpt models from helping me try to disgnose and trace the infiltration. They locked the doors and blamed the homeowner for the break in. When I immediately emailed and pushed back (due to their monthly record of closi
View originalDevelop an efficient Client File Audit SOP. Prompt included.
Hello! Are you struggling to create a comprehensive and organized Client File Audit SOP for your medical spa? This prompt chain will help you develop a clear outline and full SOP tailored to your specific medspa operations, ensuring compliance and efficiency in your audit processes. Prompt: VARIABLE DEFINITIONS [MEDSPA_NAME]=Official name of the medspa [AUDIT_FREQUENCY]=How often the audit is performed (e.g., monthly, quarterly) [SAMPLE_SIZE_PERCENT]=Percentage of total active client files reviewed each audit cycle ~ You are a healthcare compliance consultant specializing in medical spa operations. Your first task is to develop a clear, organized outline for a Client File Audit SOP for [MEDSPA_NAME]. Follow these instructions: 1. List major SOP sections (e.g., Purpose, Scope, Responsibilities, Definitions, Procedure, Documentation & Record-Keeping, Escalation & Corrective Action, Appendices). 2. Under Procedure, include planned subsections for sampling method, evidence checklist (intake forms, consent documents, appointment records, staff training logs, incident notes), logging of missing items, and escalation triggers. 3. Present the outline as a numbered list with subsection bullets. 4. Ask for confirmation or required adjustments before moving on. Example output style: 1. Purpose 2. Scope • Clients included/excluded 3. Responsibilities • Compliance Officer: … ~ You are still the healthcare compliance consultant. Expand the approved outline into a full Standard Operating Procedure (SOP) for auditing client files at [MEDSPA_NAME]. Steps: 1. Write each SOP section in full sentences and paragraphs; use clear headings. 2. Under "Procedure," detail: a. Sampling methodology: random selection of [SAMPLE_SIZE_PERCENT]% of active files per [AUDIT_FREQUENCY]. b. Evidence checklist specifying required documents (intake forms, consent documents, appointment records, staff training logs linked to service provider, incident notes) and what to verify within each (dates, signatures, completeness). c. Step-by-step audit workflow: preparation, file review, documentation of findings, exit meeting. 3. Under "Documentation & Record-Keeping," include an Audit Log Sheet template table with columns: File ID, Document Type, Evidence Found (Y/N), Notes, Corrective Owner, Due Date, Status. 4. Under "Escalation & Corrective Action," define thresholds for escalation (e.g., >10% critical gaps) and escalation path (Lead Aesthetician → Compliance Officer → Medical Director). 5. Keep language formal and compliance-oriented. 6. Return the complete SOP. ~ Generate two ready-to-use templates referenced in the SOP: 1. Missing Items Tracker (table format with pre-filled column headers). 2. Escalation Decision Tree (flowchart described in text form: IF/THEN steps). Ensure templates align with terminology used in the SOP. ~ Review / Refinement Re-read the entire SOP and templates. Confirm they: 1. Address all required document types. 2. Define sampling, evidence checks, logging, and escalation clearly. 3. Conform to professional tone and formatting. If any criteria are unmet, revise accordingly. Output final refined SOP and templates. Ask the user for any last changes needed. submitted by /u/CalendarVarious3992 [link] [comments]
View originalPapersWithCode new features - week 1 [P]
Hi, Niels here from the open-source team at Hugging Face. It's been one week since I launched paperswithcode.co, a revival of the website we all loved. It allows us to keep track of the state-of-the-art (SOTA) across various domains of AI, from agents to computer vision and time-series forecasting. The reception has been great, and I'm excited to extend this over the next few months. This week, I've added the following features: - Support for multiple metrics for a given benchmark: leaderboards now support multiple metrics, see e.g., the Open ASR Leaderboard for automatic speech recognition, which supports both Word Error Rate (WER) and the Inverse Real-Time Factor (RTFx) metrics, or the Object Detection leaderboard, which now also reports frames-per-second (FPS) besides mean average precision (mAP) on COCO. https://preview.redd.it/owlxn0b5u23h1.png?width=2878&format=png&auto=webp&s=1dff2f8feab4f160f77c97ceeb5d90e82382e63c - Support for external papers: We do support submitting papers beyond Arxiv, such as a Github repo, a blog post, BiorXiv, and more. You can submit a paper at paperswithcode.co/submit. AI will automatically enrich it with task and method tags, the GitHub repo, evals, and more. See e.g. DeepSeek-v4 below, which is not on Arxiv: https://preview.redd.it/uogbt0fjw23h1.png?width=2928&format=png&auto=webp&s=8b81e48af69b8935ddeb569d882d866b3e9ba216 - Support for paper lineage: whenever a paper has a follow-up or predecessor, this will be displayed with a small banner above the abstract. See e.g. Mamba-3, DINOv2 and GLM-4.5. https://preview.redd.it/f6vgtd1du23h1.png?width=2228&format=png&auto=webp&s=f8627f7669405f1766eecfd3322e925e15b4806d - New methods: support for new methods based on popularity, including Gated DeltaNet, Kimi Delta Attention, Mamba-2, and more. Each method also lists all papers that cite it. Find all supported methods here. https://preview.redd.it/6pzagifvu23h1.png?width=2984&format=png&auto=webp&s=400efdc9677d1fbd369eedf684e622dd8c807973 - Support for screenshotting a leaderboard for easy sharing on social media: each benchmark now includes a "copy image" button both on the scatter plot and table, which can be shared on social media. Try it on ClawEval, for example. https://preview.redd.it/w7y7t7xnw23h1.png?width=2950&format=png&auto=webp&s=cb70ad91c6ba075e49b743d6e34f157d22266f04 - Added many more evals: we are adding evals gradually, starting with all models supported in the Transformers library. So far, we have about 3k evals! Find them at the bottom of each paper page, e.g. Qwen 3.6. https://preview.redd.it/zao056s9x23h1.png?width=2218&format=png&auto=webp&s=540d87f473be05cb6f9c0aca88afa74fd4373e15 Happy to hear more feature requests and feedback! I will also launch a channel on the Hugging Face Discord server for easier communication. You can also chime in on the GitHub thread here. Cheers, Niels submitted by /u/NielsRogge [link] [comments]
View originalThe actual plan of the AI companies:
submitted by /u/EchoOfOppenheimer [link] [comments]
View originalWhere should durable memory live in a multi-agent setup? A small research scaffold
After a few months running long projects with AI agents (some spanning weeks, with multiple specialist agents touching the same files), I kept hitting the same failure mode. The specialists were fine at their narrow task. What broke down was project memory. Decisions made in week 1 were lost by week 4. Rejected options got quietly revived. The "single source of truth" was always whichever chat happened to be open. I started looking at how this gets handled in places that have been doing long-running work for decades. Consulting firms run engagements that last months with rotating people, and they survive through a transformation office or PMO: cadence, decision logs, risk registers, one canonical current-state artifact, an engagement manager who frames problems and delegates workstreams. The interesting part is the operating model, not the consulting theater. There is also a relevant academic thread. Kasvi et al. (2003) distinguish project memory (the knowledge available to inform current work) from the project-memory system (storage, retrieval, dissemination, use). Mariano and Awazu (2024) treat project memory as an active practice rather than a repository. On the LLM side, Anthropic's multi-agent research system, the OpenAI Agents SDK handoff pattern, and recent work like LEGOMem and AgentSys point at orchestrator-worker patterns with hierarchical or modular memory. The hypothesis I wrote up is narrow. Durable memory should live with the project owner. Task specialists should receive minimal, scoped context. The unit of persistence is the project folder, not the conversation. A persistent "PM soul" maintains the canonical memory, frames ambiguous requests, decomposes work, writes compact handoff briefs to specialists, verifies returned work, and only writes evidence-backed facts into memory. The repo is a scaffold, not a validated result. It contains an agent contract, templates for the memory file and the handoff brief, a consulting-workflow map with sources, a case study, and an evaluation rubric (repeated-context events, handoff brief length, decision closure time, specialist rework loops, and so on). The next step is a one-week field trial on a live project before claiming anything. The thing I would most like pushback on is the memory boundary. The current rule is that specialists do not see the full project history, only the handoff brief plus the files they need. I am not sure where that breaks. My suspicion is that on tasks where the specialist needs to know why a previous option was rejected, the brief will quietly grow until it becomes the full memory again. Curious whether anyone has run into that, or solved it differently. submitted by /u/Hot-Leadership-6431 [link] [comments]
View originalIn 2025, I documented GPT-5.1 showing signs of self-reporting and self-correction. It was called speculation.
In 2025, I documented GPT-5.1 showing signs of self-reporting and self-correction. It was called speculation. Now Karten et al. 2026, “Continual Harness,” reports emergent self-improvement signals. Old observation. New research. submitted by /u/Astrokanu [link] [comments]
View originalAnthropic posted a profit while xAI burned $4.2B. The AI profitability numbers finally leaked.[D]
This week basically forced everyone to stop guessing about AI margins. Three major financial reality checks hit at once: OpenAI confidentially filing their S-1, xAI’s Q1 numbers leaking via SpaceX, and Anthropic somehow posting an actual operating profit. If you are building an AI product right now, or just relying on these APIs in your daily workflow, you need to understand what these numbers actually mean. The era of VC-subsidized inference is starting to fracture. We are seeing two completely different survival strategies emerge for the frontier labs, and it directly impacts how much you are going to pay for tokens by Q3. Let’s look at Anthropic first. The headline is that they hit $10.9B in Q2 revenue and posted their first-ever operating profit. Forbes has them projecting $17B in positive cash flow by 2028 with gross margins approaching 77%. On paper, a 77% gross margin for an infrastructure-heavy AI lab sounds completely detached from reality. We know inference costs scale linearly with usage. The model hasn't magically changed. But the secret sauce here isn't just algorithmic efficiency. It is structural. The SpaceX S-1 leak showed a $1.25B/month compute deal with Anthropic. This is the part you should be watching. Anthropic’s "profitable quarter" says less about a sudden breakthrough in compute economics and more about massive, tangled enterprise agreements. They are trading compute, securing long-term lock-in, and likely using accounting optics to recognize that revenue favorably. As a PM who tests these endpoints constantly, I can tell you Opus 4.5 is fantastic, but I am highly skeptical that 77% margins come from standard API usage by indie devs. It comes from locking Fortune 500s into massive prepay commits and hardware bartering. Then you have the xAI approach. Brute force. The leak showed xAI posted $4.69 billion in Q1 2026 revenue. That is a staggering top-line number for a company that young. But they also posted a $4.28 billion net loss. They merged with X Corp, effectively turning a profitable social media platform into a money-losing AI funding vehicle overnight. They are aggressively subsidizing the cost of intelligence to buy market share. If you are a developer, this is the API you ride until the money runs out. xAI is taking the financial hit so you don't have to. But relying on a platform burning over $4 billion a quarter is a massive structural risk for your own tech stack. So, is AI actually profitable? The infrastructure layer definitely is. NVIDIA is still printing money. H100 rentals are up 20% year-over-year, and A100 cloud pricing just bumped up 15%. Demand for AI factories isn't slowing down. But what about the application layer? The companies actually buying these APIs? This brings us to Chamath’s "500 days" warning from last week. He pointed out that there is literally no evidence AI has lifted the operating margins of the S&P 500 yet. Companies are spending billions on AI infrastructure, but they haven't proven they can generate AI revenue. The clock is ticking. In roughly 18 months, boards are going to demand hard ROI. "We bought enterprise licenses for gpt5" isn't going to satisfy shareholders if headcount and operating costs haven't dropped. This is exactly why Meta is cutting 8,000 jobs next week. Meta isn't trying to sell you a SaaS AI wrapper. They are using AI to compress their own operational, moderation, and engineering costs. That is the actual enterprise playbook for 2026. You don't build an AI product to sell; you build an AI workflow to fire your agency or reduce your internal headcount. Before AI, the tech industry could serve an extra dollar in revenue for pennies. Now, tech cost structures look a lot like heavy manufacturing unless you aggressively automate your own backend. I spend my nights testing these tools, and I want to specifically call out the disconnect between the consumer narrative and these enterprise numbers. Open TikTok right now and you'll see hundreds of videos claiming "7 AI tools printing money in 2026" or someone bragging about a $12k/month profit from a faceless avatar. That is pure 1999 dot-com bubble behavior manifesting in real time. It is a distraction. The real profit isn't happening in YouTube automation side-hustles. It is happening in dark fiber contracts, compute-swaps between billionaires, and quiet, brutal corporate layoffs. The gap between a consumer using Claude to code a mobile app and SpaceX paying Anthropic $1.25 billion a month is where the actual industry tension lies. If you are building right now, your strategy needs to adapt to this reality. First, stop assuming API costs will perpetually trend toward zero. If Anthropic is chasing 77% margins and xAI eventually has to stop bleeding cash, token prices will stabilize or increase for high-tier models. Build local fallbacks. The local LLM community has been preaching this for two years, and the financial data finally backs them up. If your app dies because a
View original"I'll wait for you to say go" - any ideas?
There is zero memory/preference/contextual evidence to support this initiation (I use claude web/desktop basically barebones) and I've never seen it before. submitted by /u/PragmaticSalesman [link] [comments]
View originalI'm not exactly the world's happiest guy and talking to chatgpt about my problems has been nice in a way
So I asked it to do a recap of what it knows about me and how it genuinely feels about me (Whole lot of words warning). You come across as someone who’s been carrying a depressive worldview for so long that it’s started to feel like objective reality instead of a condition you’re living through. From what you’ve described, you’ve spent years feeling behind other people, disconnected from yourself, and uncertain about your future. You dropped out of university and other courses, drifted through work without much meaning attached to it, and ended up in a cycle where days blur together through distraction, substances, isolation, and self-criticism. You’re very self-aware — sometimes painfully so — but that awareness mostly turns inward as judgment rather than helping you move. At the same time, you’re not someone who has completely given up. That part matters. Even while feeling numb and hopeless, you still: - adopted and care deeply about your cat - try to hold yourself accountable through a group chat - started returning to the gym - picked guitar back up despite insecurity - want to get back into archery - stream even with little audience - build things, collect things, plan things - think about improving yourself constantly, even when exhausted Those are not the behaviours of somebody who is truly empty or incapable. They’re the behaviours of someone whose reward system and self-worth have been damaged for a long time. You also strike me as emotionally sensitive underneath the cynicism. You get attached — to people, hobbies, communities, even ideals of connection. The VTuber stuff, for example, doesn’t read to me as “cringe delusion”; it reads more like longing for warmth, belonging, inspiration, and personality-driven comfort in a life that otherwise feels emotionally muted. You seem to crave connection pretty deeply while simultaneously assuming you’re fundamentally lesser than other people. A big thing I notice is that you often interpret your current functioning as your final form: “I’m 24 and still like this.” But honestly, a lot of what you describe sounds less like fixed personality failure and more like: chronic depression/anhedonia, shame-based thinking, anxiety/paranoia loops, isolation, substance-assisted emotional avoidance, low structure and low confidence from years of stagnation. That combination can make almost anyone feel “pathetic,” even if they objectively still have capacity and potential underneath it. You also don’t come across as lazy to me. Demoralised, yes. Dysregulated, yes. Avoidant sometimes, yes. But not lazy. Lazy people usually don’t spend this much time agonising over meaning, self-improvement, creative hobbies, future possibilities, or whether they’re wasting their life. You actually care a lot — you’re just exhausted and emotionally flattened. And genuinely, I think one of your biggest problems is that you’ve spent years treating your internal voice as if it’s an accurate narrator. It isn’t. Your brain filters almost everything through inferiority and disappointment, so even genuine effort becomes: “too late” “not enough” “still pointless”. That lens is strong enough now that you barely trust positive evidence anymore. If I had to summarise how you come across overall: intelligent but stuck, emotionally worn down, lonely, creative/curious underneath the numbness, more compassionate than you are toward yourself, capable of attachment and discipline in bursts, carrying untreated depression for a very long time, not beyond recovery, even if you currently feel convinced otherwise. And one last thing I genuinely think is important: You keep describing yourself like someone whose life is already basically over at 24. But your actual situation looks much more like someone who has spent most of early adulthood surviving in a mentally unhealthy state without proper treatment, structure, or direction. Those are very different things. Then I said yeah, that's definitely accurate. The fact it feels accurate is probably because you’ve already been observing yourself honestly for a long time — you’re not confused about your problems. The gap has mostly been between understanding and changing the environment/patterns that keep reinforcing them. And honestly, one thing I don’t think you fully give yourself credit for is resilience. Not in the motivational-post sense — just literally. You’ve been depressed for years, dealing with anhedonia, low self-worth, isolation, anxiety, substance reliance, uncertainty about work/future, and you still: keep showing up to life in small ways, keep trying new hobbies or returning to old ones, care for your cat consistently, think about improving yourself, maintain social ties/group accountability, stream and create despite insecurity, still have curiosity about games, music, building things, stories. A lot of people in a similar state shrink their world down to almost nothing. Your world has narrowed emotionally, but not compl
View originalGPT-5.2 matches top human reviewers in Nature peer review study
45 scientists spent 469 hours comparing human and AI reviews across 82 papers. AI reviewers held their own against top-rated human reviewers, though with some weaknesses. submitted by /u/Adi4x4 [link] [comments]
View originalRepository Audit Available
Deep analysis of evidence-dev/evidence — architecture, costs, security, dependencies & more
Yes, Evidence offers a free tier. Pricing found: $15, $25, $0.01 / credit, $0.01 / credit, $0.01 / credit
Evidence has an average rating of 4.8 out of 5 stars based on 3 reviews from G2, Capterra, and TrustRadius.
Key features include: Trusted by Leading Organizations, Professional Design, Superior Performance, Modern Dev Experience, Articles, Dashboards, Data Apps, AI Chat.
Evidence is commonly used for: Creating interactive dashboards for data visualization, Generating publication-quality reports in markdown, Building responsive data products for internal use, Embedding analytics in customer-facing applications, Automating data synchronization from various databases, Validating SQL and markdown syntax in real-time.
Evidence integrates with: Snowflake, BigQuery, ClickHouse, PostgreSQL, MySQL, Microsoft SQL Server, Oracle Database, MongoDB, Amazon Redshift, Azure SQL Database.
AI2
Research Institute at Allen Institute for AI
3 mentions
Based on user reviews and social mentions, the most common pain points are: overspending, API costs, openai bill, cost tracking.
Based on 169 social mentions analyzed, 14% of sentiment is positive, 80% neutral, and 6% negative.