Zep Review — 4.3★ from 2 Reviews | Pricing & Alternatives | Payloop

Zep

frameworkai-memorysubscription + freemium + tieredFree tier

Zep connects your data sources, builds a unified context graph of your users, and delivers assembled context to your agent. One pipeline. One API.

Zep is generally well-regarded by users for its performance, earning ratings of 4/5 and 4.5/5 on G2, indicating a positive overall reputation. Users appreciate its capabilities, particularly within the AI and persistent memory space. However, some social mentions highlight frustration with context and memory issues when switching between sessions or different platforms, suggesting room for improvement in integration and context management. Pricing sentiment is not directly mentioned, but the general reputation indicates users find value in Zep despite potential pain points.

Mentions (30d)

4

4 this week

Avg Rating

4.3

2 reviews

Platforms

3

GitHub Stars

4,316

597 forks

20 integrations10 featuresSeed

Voices Discussing Zep

Aparna Dhinakaran

CEO at Arize AI

11 mentions

Jason Liu

Creator at Instructor (structured outputs)

4 mentions

Yohei Nakajima

Creator at BabyAGI

3 mentions

Latest Videos

Context Engineering Made Simple With Zep's Context Templates

Context Engineering Made Simple With Zep's Context Templates

Dec 15, 2025

Zep/Neo4J Workshop: Context Engineering with Graphs + Contest

Zep/Neo4J Workshop: Context Engineering with Graphs + Contest

Dec 13, 2025

Share:Twitter LinkedIn

Product Screenshots

Zep screenshot 1

Zep screenshot 2

AI Summary

Zep is generally well-regarded by users for its performance, earning ratings of 4/5 and 4.5/5 on G2, indicating a positive overall reputation. Users appreciate its capabilities, particularly within the AI and persistent memory space. However, some social mentions highlight frustration with context and memory issues when switching between sessions or different platforms, suggesting room for improvement in integration and context management. Pricing sentiment is not directly mentioned, but the general reputation indicates users find value in Zep despite potential pain points.

Features & Use Cases

Features

Teams building with ZepIngestGraphAssembleEvery SourceBuilt for Real-TimeThree Lines of CodeChat MemoryStatic RAGTool Calls

Use Cases

Enhancing customer support interactions with personalized responsesStreamlining project management by integrating team preferences and historical dataFacilitating onboarding processes with tailored training materials based on user profilesImproving e-commerce recommendations by analyzing user behavior and preferencesAutomating report generation with context-aware insights from business dataSupporting sales teams with real-time access to client history and preferencesEnabling personalized marketing campaigns based on user engagement dataOptimizing content delivery in educational platforms by adapting to learner progress

Company Intel

Industry

information technology & services

Employees

6

Funding Stage

Seed

Total Funding

$2.4M

Social Reach

417

GitHub followers

Developer Ecosystem

11

GitHub repos

4,316

GitHub stars

15

npm packages

Mentions by Platform

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

Pricing

subscription + freemium + tieredFree tier available

Pricing found: $125 /month, $25 / 10, $375 /month, $75 / 40, $25

Review Ratings

g2

4.3(2)

Recent Reviews

Samantha M.

4/9/2026

What do you like best about ZEP?What we like best about ZEP is the successful combination of intuitive operation, modern user interface, and a wide range of functions. Above all, the new, contemporary design makes daily work noticeably easier and ensures that new employees can quickly find their way around.Another clear advantage is the fast and competent support: inquiries are handled reliably, solution-oriented, and comprehensibly, which greatly relieves us in our daily work.Additionally, we are impressed by the high flexibility due to the many additional modules. This allows us to tailor ZEP precisely to our processes and use exactly the functions we actually need.ZEP works very reliably and efficiently – even with higher loads or many parallel projects running. Loading times are short, and the application runs stably, which is extremely important in daily business.Due to the time savings, central data management, and reduction of manual processes, the costs quickly pay off from our perspective. Especially with the modular extensions, you only pay for what is actually needed – this makes the investment well-planned and sensible.Classic AI functions are not the focus, but ZEP convinces with meaningful automations and well-thought-out processes that make everyday work significantly more efficient – especially with recurring tasks such as billing, evaluations, or data maintenance. Review collected by and hosted on G2.com.What do you dislike about ZEP?We see potential for improvement primarily in the onboarding process and complexity with very extensive use. Due to the many functions and modules, it may initially take some time for new employees to fully understand and efficiently use all the possibilities.Additionally, individual evaluations and reports could be made more flexible or intuitive in some areas, especially when there are very specific requirements.The cost structure when using many additional modules should also be kept in mind, as it can add up depending on the extent – even though it is fundamentally justified by the added value provided. Review collected by and hosted on G2.com.

Tia K.

7/19/2022

What do you like best about ZEP?I love the customizable options and the ease of use. Review collected by and hosted on G2.com.What do you dislike about ZEP?The interface can sometimes be a little confusing. Review collected by and hosted on G2.com.

Mention Activity (Last 12 Weeks)

Platform Distribution

Sentiment Overview

Positive23% (5)

Neutral77% (17)

Negative0% (0)

Common Pain Points

cost tracking (1)openai bill (1)spending limit (1)

Top Topics

open source (6)model selection (5)performance (4)documentation (4)api (4)cost optimization (4)support (3)deployment (3)agents (3)data privacy (3)RAG (3)security (2)accuracy (2)streaming (2)migration (2)workflow (2)scalability (1)pricing (1)ease of use (1)

Recent Mentions

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

youtube

Zep AI

Zep AI

reddit@[unknown]5/23/2026

After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else.

Going to get downvoted for this but here we go. I've been running about 30 agents in production for paying customers for the last 6 months and I'm convinced the framework debate is mostly a distraction. LangChain, CrewAI, AutoGen, OpenAI Agents SDK. Pick whichever one your team already knows. It doesn't matter as much as you think. What actually decides whether your agent works in production is something almost nobody talks about on this sub, and it isn't in the framework. Here's what I've seen kill more agents than every framework bug combined. The agent gets stuck in a loop. It calls the same tool 200 times in 4 minutes because something downstream returned ambiguous data and the LLM decided to retry forever. Your OpenAI bill goes from $3 a day to $400 in one afternoon. By the time you notice you've burned a grand. You can't even tell which agent did it because there's no audit trail. Your VPS reboots overnight for kernel patches. Every agent that was mid-task loses everything. Tomorrow morning the support agent has no memory of yesterday's tickets, the research crew has forgotten what they were investigating, the pipeline agent restarts from scratch. None of these are framework problems. They're memory and state problems. A customer complains the agent gave them wrong info three days ago. You go to debug. There's no record of what the agent saw, what it decided, or which tool calls it made. The framework didn't log that because frameworks aren't observability tools. You shrug and refund. You scaled to 15 agents working together. Two of them have conflicting beliefs about the same customer because their memory isn't shared. The customer gets two different answers in the same conversation depending on which agent replies first. You've been around enough times to realize the part you actually need isn't in the framework at all. What I think the real stack is. The framework just orchestrates LLM calls. Use whatever your team likes. It's the cheap layer. A persistent memory layer that survives crashes, restarts, and redeploys, so the agent has actual continuity. This is the layer that decides whether your agent is a toy or a product. Loop detection at the runtime layer, not bolted on as a wrapper around the framework. Something that catches your agent making the same call too many times in a row and stops it before the bill explodes. An audit trail of every decision the agent made, with a hash chain so you can prove later what happened when the customer pushes back. Screenshots and logs aren't enough when ten thousand dollars is on the line. Shared memory between agents in the same team so they're not having different conversations about the same customer. Cost tracking per agent so you actually know which one ran away with your budget. When I look at what makes the agents that survive production look different from the ones that died, it's never that they picked the right framework. It's that they had this layer underneath, either built carefully in-house or borrowed from somewhere. Full disclosure I'm building one of these tools. There are others. Mem0 and Zep and Letta in the memory space. Helicone and LangSmith in the observability space. Mix and match. Use one or build your own. Just please stop arguing about whether LangChain or CrewAI is better when the thing eating your production agents has nothing to do with either of them. What's been your worst production agent failure? Curious what other people have actually hit. I built a free tool that aims to solve most of this issue, what do you think? submitted by /u/DetectiveMindless652 [link] [comments]

reddit@[unknown]5/18/2026

Which project/framework has actually nailed persistent memory for AI agents?

Not talking about the LLM itself but about the memory layer on top. There are quite a few out there now, open source ones and proprietary frameworks. Curious what people have actually tried and stuck with. Which one just worked for you? After doing my share of research about different frameworks, I feel like zep and Cognis are pretty good. submitted by /u/Meher_Nolan [link] [comments]

reddit@[unknown]5/16/2026

I have figured out a way to run every memory system out there on one platform

But is there an industry need for it ... It's smth like vlc media player of memory systems ... My team thinks it's hard to make money from it or its hard to sell ... What do y'all think In this system it's like you can fetch like zep for your temporal needs , store like letta if needed , traverse like mempalace or hindsight etc all in one place Thoughts? submitted by /u/boneMechBoy69420 [link] [comments]

reddit@[unknown]5/16/2026

I have figured out a way to run every memory system out there on one platform

But is there an industry need for it ... It's smth like vlc media player of memory systems ... My team thinks it's hard to make money from it or its hard to sell ... What do y'all think In this system it's like you can fetch like zep for your temporal needs , store like letta if needed , traverse like mempalace or hindsight etc all in one place Thoughts? submitted by /u/boneMechBoy69420 [link] [comments]

reddit@[unknown]5/15/2026

Am I stupid for pivoting to Transparency with Agents over Memory after 6 months?

built an open source memory layer for ai agents. thought the obvious feature people would care about was persistent memory across restarts and shared memory between agents. that was the whole pitch. few months of actual user data in. most of the api calls aren't about memory at all. they're hitting the audit trail (what did the agent do and when), the loop detector (catching when an agent is stuck doing the same thing 20 times in a row), and the per-agent performance dashboard (which agent is wasting tokens, which one keeps crashing, who's drifting off goal). basically people don't really care that their agent remembers stuff across restarts. they care that they can see what it did and pull the plug when it goes off the rails. so i'm wondering if i should just flip the pitch. lead with "observability and accountability for ai agents" instead of "memory for ai agents". memory is table stakes at this point and mem0/zep already dominate that framing. loop detection + audit trail + performance scoring per agent feels like open territory. am i stupid? or is this the obvious move i somehow missed for 3 months submitted by /u/DetectiveMindless652 [link] [comments]

reddit@[unknown]5/15/2026

Memory drift? Context bloat? A Claude Code skill I wrote to manage long-running memory libraries

I've been running Claude Code's auto-memory on the same project for about three months. Roughly a month in, the library started getting hard to use: the same lesson recorded under three different filenames, frontmatter missing on half the files, searching for "that bug we fixed last month" returned nothing useful. Every new session, Claude loaded more and more memory files, and the context window kept getting crowded with irrelevant entries. I wrote a skill that enforces a naming schema and a bash audit script that flags drift. Sharing in case it's useful. What the skill does Claude Code's auto-memory (v2.1.59+) writes plain markdown to ~/.claude/projects/ /memory/. The files are yours to read, edit, and version. What it doesn't enforce is structure — naming, required fields, or a Why section on each lesson. Schema on top of auto-memory. _ .md naming, required frontmatter (name / description / type), Why section on feedback entries. Auto-memory still writes; the skill makes Claude write to a spec. Phrase-triggered review. "Audit memory" runs the script. "Review session" walks the recent session and surfaces what's worth keeping. Soft warning, no hooks. Audit reports drift; nothing blocks a write. Plain markdown on disk. Edit, grep, git-commit. The skill doesn't add a database or daemon. Effect One topic per file means Claude lands on the right entry on the first lookup, not after several near-misses. A deduplicated library loads fewer files per session, freeing context for the work itself. Sample audit output: Memory audit · 2026-05-15 · 132 files Hard checks (must be zero): missing frontmatter 0 frontmatter fields 0 feedback missing Why 1 naming violations 0 broken MEMORY.md links 0 Soft signals: oversized files 78 groups over 15 entries 3 untouched 30+ days 31 not in MEMORY.md 0 Hard-rule compliance: 99.2% (1 violation / 132 files) Install Paste this into any Claude Code session: Install the claude-memory-manager skill from https://github.com/jau123/claude-memory-manager Claude handles the rest. To verify, say "audit memory" in a new session. First use The skill activates from natural language. No slash command. You: "Record today's wildcard bug fix" → Claude writes one feedback_*.md entry: filename, frontmatter, Why section, How-to-apply. You: "Review the session" → Claude walks recent session, surfaces 3–5 candidates, asks which to keep. You: "Audit memory" → Runs scripts/audit-memory.sh, reports compliance, lists files that need splitting. vs the built-in auto-memory Schema Audit Long-term result Auto-memory alone None (Claude decides) None with this skill 3-type schema + required fields + Why on feedback One-command script For semantic retrieval over chunked storage, look at vector-backed tools like Mem0, Letta, or Zep. Limits Single-project scope. One memory directory per skill instance. No semantic ranking. The audit is pattern matching; it won't catch two files describing the same concept in different words. Bash; Windows / git-bash untested. Overkill for small libraries. Below ~10 entries or a month of project age, the built-in auto-memory is sufficient. GitHub: https://github.com/jau123/claude-memory-manager Curious whether others have hit this drift problem on long-running Claude Code projects, and how you handled it — especially anyone who tried hook-based enforcement and gave up. Schema feedback (3 types of feedback / reference / project) also welcome. submitted by /u/Deep-Huckleberry-752 [link] [comments]

reddit@[unknown]4/18/2026

Spent 3 months building an MCP memory server for Claude. No idea if anyone else will want this.

Been using Claude Code heavily for the last year, both at my day job and on side projects. The thing that kept killing me was starting a new session and having to re-explain everything. What I'm working on, what I decided last week, why I chose Postgres over Mongo, the architectural tradeoffs I'd already reasoned through. Every single time. I tried the obvious stuff first. CLAUDE.md files hit a ceiling pretty fast. Obsidian is great for notes but can't answer "why did I decide this?" Mem0 was closer but just didn't retrieve well enough for the questions I actually cared about. So I started building my own on nights and weekends. Called it Genesys. It's an MCP server. You point Claude at it and it stores memories as a causal graph instead of flat vectors. When you ask "why did I choose X?" it traces the chain and shows you. Memories also decay over time based on how often they're accessed and how connected they are to other memories, so stale stuff doesn't pollute retrieval forever. If you want to try it One-line install: bash pip install genesys-memory Or paste this to Claude and let it set everything up for you: Install genesys-memory, create a .env with my OpenAI key, start the server on port 8000 with the in-memory backend, and connect it as an MCP server. Works with Claude Code: bash claude mcp add --transport http genesys http://localhost:8000/mcp Or Claude Desktop by adding it to claude_desktop_config.json. If you want to keep everything local (no OpenAI, no cloud): bash pip install 'genesys-memory[obsidian,local]' Set GENESYS_BACKEND=obsidian, GENESYS_EMBEDDER=local, and point OBSIDIAN_VAULT_PATH at your vault. It uses sentence-transformers for embeddings (downloads a ~80MB model on first run), your markdown files become memory nodes, your wikilinks become causal edges, and a SQLite sidecar in .genesys/ handles indexing without touching your files. No API keys required, nothing leaves your machine. Four storage backends total (in-memory, Postgres + pgvector, Obsidian, FalkorDB). Apache 2.0. GitHub: https://github.com/rishimeka/genesys The benchmark, since people are going to ask I ran it on the full LOCOMO benchmark out of curiosity. 1,540 questions across 10 multi-session conversations, gpt-4o-mini as both the answering and judging model (same setup Mem0's paper used, apples-to-apples). Single-hop: 94.3% Open-domain: 91.7% Temporal: 87.5% Multi-hop: 69.8% Overall: 89.9% For context: Mem0 scored 67.1% on the same benchmark, Zep scored 75.1% (their corrected number), and just dumping the entire conversation into the context window scores ~73%. All three scripts (ingest, eval, judge) and the full 1,540 judged results are in the repo. You can reproduce it on your machine. Two honest notes. First, MemMachine scored 91.7% using gpt-4.1-mini (a stronger answering model than mine), so I'm not claiming top of the leaderboard. Second, an independent audit of LOCOMO found ~99 ambiguous ground truth answers in the dataset itself, so the real ceiling is more like 93-94%, not 100%. Anyone claiming 100% is either overfitting or using a generous judge. What I still go back and forth on The thing I genuinely don't know is whether the causal graph approach is worth the complexity. Multi-hop queries at 69.8% are where it falls apart, and I can tell you why: the retrieval finds the right context, the answering model just doesn't always make the inferential leap. That's a real flaw, not a polished one. Benchmarks and real-world usage are also different animals. It's been working well for me personally. That's n=1. Which is why I'm here. What I'm actually looking for feedback on For those of you using memory with Claude Code or Desktop, what's your current setup? What works, what doesn't? Is the "why did I decide this?" query something other people actually want, or is it just my brain that works this way? If you clone the repo and try it, what's the first thing that breaks or annoys you? Genuinely want to know. I'll be here for the next few hours replying to everything. Roast it, ask questions, tell me I'm overengineering it. submitted by /u/StudentSweet3601 [link] [comments]

reddit@[unknown]4/14/2026

Sekha — persistent memory for Claude Code (stays across sessions), plus rules the AI has to follow

I got tired of re-explaining my preferences to Claude Code every morning, so I built Sekha: https://github.com/Thoth-soft/sekha What it does: **Remembers things across sessions.** Tell Claude "I prefer Postgres over MySQL for new projects" in one session. Close it. Open a new session tomorrow. Ask what database you prefer — it answers correctly, because it saved the preference as a markdown file and retrieved it on demand. Claude drives save/retrieve itself via 6 MCP tools (sekha_save, sekha_search, sekha_list, sekha_delete, sekha_status, sekha_add_rule). **Rules the AI can't ignore.** Every other memory system (Mem0, MemPalace, Letta, Zep, Basic Memory) stores rules but the AI decides whether to follow them. Sekha uses Claude Code's PreToolUse hook to hard-block tool calls that match a rule you've written. Works even with `--dangerously-skip-permissions`. So you can write a rule like "never delete /important/", "never force-push to main", "never run DROP TABLE" — and Claude literally cannot run those commands, no matter how you word the request. Quick facts: - Zero runtime dependencies (pure Python stdlib) - Python 3.11+ - Cross-platform, 9-cell CI matrix (Win/mac/Linux x 3.11/3.12/3.13) - 349 tests - Hook latency: p50 under 50ms on Linux/macOS, ~300ms on Windows (Python cold-start floor) - Plain markdown storage, no database, no embeddings, grep-based search - MIT, pip install sekha Scope honesty: - **Hard enforcement only covers rules that can be matched against what Claude is about to do** — specific command patterns, file paths, tool names. - **Behavioral rules** like "always confirm before acting" or "no guessing" stay prompt-level. The AI can ignore them. No hook exists for the AI's reasoning, only its actions. README threat model explains why. Install: pip install sekha sekha init That's it. `sekha init` auto-registers the MCP server with Claude Code. Feedback I'd find valuable: - Edge cases in memory retrieval (things it should find but doesn't, or things it finds but shouldn't) - Rule patterns you want to ship for common mistakes - Other AI clients where this pattern could work (anything with a hook that fires before tool execution) Example rules in `examples/rules/` for copy-paste. Happy to answer questions in comments. submitted by /u/Live-Flamingo3149 [link] [comments]

reddit@[unknown]4/8/2026

I got tired of re-explaining myself to Claude every session, so I built something

I got tired of re-explaining myself to every AI tool, so I built one that makes my context portable Hello everyone out there using AI every day… I build cardiac implants at Boston Scientific during the day and I’m a 1st year CS student. I use Claude, ChatGPT, Cursor, and Gemini daily to improve my skills and my productivity. But every tool starts from zero. Claude doesn’t know what I told Cursor. ChatGPT forgets my preferences. Gemini has no idea about my stack. I was spending the first 5 minutes of every session re-explaining who I am. Over and over. So I built aura-ctx; a free, open-source CLI that defines your AI identity once and serves it to all your tools via MCP. One source of truth. Everything stays local. No cloud. No lock-in. This is not another memory layer. Mem0, Zep, and Letta solve agent memory for developers. aura-ctx solves something different: the end user who wants to own and control their identity across tools. No Docker. No Postgres. No Redis. No auth tokens to manage. Just: pip install -U aura-ctx aura quickstart Why local-first matters here: your MCP server runs on localhost. No network latency. No auth hell. No token refresh. If you’ve dropped cloud-based MCP servers because of the overhead, this is the opposite architecture. Portability is by design: your entire identity lives in ~/.aura/packs/. Move machines? Copy the folder. That’s it. Security built-in: aura audit scans your packs for accidentally stored secrets (API keys, tokens, credentials) before they leak into your context. v0.3.3 is out with 3,500+ downloads. Supports 8 AI tools including Claude Desktop, Cursor, Windsurf, Gemini CLI, Claude Code and more. Exports to CLAUDE.md and AGENTS.md for agent frameworks. Still early. I’d like any feedback on what works, what doesn’t, and what’s missing. Curious : do you re-explain yourself every time you open Claude, or have you found a better way? GitHub: https://github.com/WozGeek/aura-ctx submitted by /u/Miserable_Celery9917 [link] [comments]

performancedocumentationapisecurity

reddit@[unknown]4/8/2026

I built a persistent memory MCP for Claude Code — here's what I learned about why LLM-based extraction is the wrong approach

I've been using Claude Code daily for months and wanted it to remember things across sessions — project context, my preferences, decisions we've made together. I tried Mem0 and Zep but hit the same frustration with both: they intercept conversations and run them through a separate LLM to decide what's worth remembering. That felt wrong. Claude already understands the conversation. Why pay for a second LLM to re-interpret what just happened? So I built Deep Recall — an MCP server that takes a different approach. Claude decides what to store. The memory system handles what happens to those memories over time. **What I learned building this:** The biggest insight was that extraction quality is actually BETTER when the agent does it itself. Claude has full context — it knows what's new information vs what it already knows, what contradicts existing memories, what's important to this specific user. A separate extraction LLM has none of that context. The second insight was that memories need biology, not just storage. I implemented: - **Salience decay** based on ACT-R cognitive architecture — unused memories fade, frequently accessed ones resist decay - **Hebbian reinforcement** — when Claude cites a memory in its response, that memory gets stronger - **Contradiction detection** — if you store "works at Google" then later "works at Meta", it flags the conflict - **Temporal supersession** — detects that's a career change, not a contradiction, and auto-resolves it - **Memory consolidation** — clusters of related episodes compress into durable facts over time **How it works with Claude Code:** ```bash pip install deeprecall-mcp ``` Add to `~/.claude/settings.json`: ```json { "mcpServers": { "deeprecall": { "command": "deeprecall-mcp", "env": { "DEEPRECALL_API_KEY": "your_key" } } } } ``` Claude gets tools like `deeprecall_context` (pull memories before responding), `deeprecall_remember` (store a fact), and `deeprecall_learn` (post-conversation biology processing). **The whole thing was built with Claude Code** — Thomas (my Claude instance) and I pair-programmed the entire backend, MCP server, landing page, billing, and the biological memory algorithms. The irony of using Claude to build a memory system for Claude isn't lost on me. Free to try — 10,000 memories, no credit card, all features: https://deeprecall.dev Happy to answer questions about the architecture or the cognitive science behind the decay/reinforcement models. submitted by /u/floppytacoextrasoggy [link] [comments]

reddit@[unknown]4/7/2026

[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.

A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours. The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.). 1. The LoCoMo 100% is a top_k bypass. The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions. BENCHMARKS.md says this verbatim: The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely. The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with. 2. The LongMemEval "perfect score" is a metric category error. Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct. The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both recall_any@5 and recall_all@5, and the project reports the softer one. It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error. 3. The 100% itself is teaching to the test. The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. BENCHMARKS.md, line 461, verbatim: This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns. 4. Marketed features that don't exist in the code. The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely. 5. "30x lossless compression" is measurably lossy in the project's own benchmarks. The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. Why this matters for the benchmark conversation. The field needs benchmarks where judge reliability is adversarially validated, an

performancedocumentationapiopen source

reddit@[unknown]4/6/2026

self-hosted monitoring for Claude Code & Codex

About a month after our team started using Claude Code, someone asked in Slack how much we were spending. Nobody knew. We looked around for a monitoring tool, didn't find one we liked, and ended up building our own. Zeude is a self-hosted dashboard that tracks Claude Code and OpenAI Codex usage in one place. You get per-prompt token and cost breakdowns, a weekly leaderboard (with cohort grouping if your org is big enough to care), and a way to push skills, MCP servers, and hooks to your whole team from the dashboard instead of chasing people on Slack The big things in v1.0.0: Windows support. It was macOS/Linux only before. Now the whole team can use it regardless of OS. Codex integration. A lot of teams use both Claude Code and Codex, and tracking only one of them gives you half the picture on costs. Now both go through the same dashboard. Per-user skill opt-out. Team skill sync was already there, but it was all-or-nothing. Now individuals can turn off skills they don't want. Turns out not everyone wants every skill pushed to their machine. Stack is Next.js + Supabase + ClickHouse + OTel Collector. All your data stays on your infra. We ran it internally for ~6 months before cleaning it up for open source. It's not perfect, but it solved a real problem for us and figured others might be in the same spot. https://github.com/zep-us/zeude If you try it out, let me know what breaks. submitted by /u/Lopsided_Yak9897 [link] [comments]

pricingapisupportopen source

reddit@[unknown]3/30/2026

Agentic AI persistent memory with auto pruning based on time decay and Importance

Developing a persistent memory layer on top of your Agentic AI framework is a trending area these days, but there is no complete solution. One of the major challenges faced in developing a layer like this is how to prune your data over time. In order to tackle this problem, I did some research and found a cool formula that somewhat mimicked human memory's ebbinghaus forgetting curve. Tried to work around this concept and established a formula to use Strength = importance × e^(−λ_eff × days) × (1 + recall_count × 0.2) If I break it down: Importance : is a variable that is defined at store time. As each memory can have different importance, I decided to use this attribute. In this, I gave facts higher importance and assumptions lower importance, etc. e^(−λ_eff × days) : This I took from the original formula, it derives the decay rate and λ_eff varies based on some categories that I have defined. (1 + recall_count × 0.2): This part is to strengthen the memory if recalled again. The retrieval is straight forward and uses cosine similarity. I also benchmarked it against existing systems like Mem0 and Zep and was able to outperform them. The benchmark was done using the LoCoMo dataset and the metric was Recall@5. The result is shared in the repo itself. You guys can check that out. I would encourage you guys to check this approach once and let me know if it can be utilized in the persistent memory layer or not ! https://github.com/sachitrafa/cognitive-ai-memory Installation: pip install yourmemory submitted by /u/Sufficient_Sir_5414 [link] [comments]

performanceopen sourcedata privacyRAG

reddit@[unknown]3/27/2026

[D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo (Maharana et al., ACL 2024) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," but the source conversation contains only "this beauty" and the image caption reads "a red sports car." The car model exists only in an internal query field (annotator search strings for stock photos) that no memory system ingests. Systems are evaluated against facts they have no access to. "Last Saturday" on a Thursday should resolve to the preceding Saturday. The answer key says Sunday. A system that performs the date arithmetic correctly is penalized. 24 questions attribute statements to the wrong speaker. A system with accurate speaker tracking will contradict the answer key. The theoretical maximum score for a perfect system is approximately 93.6%. We also tested the LLM judge. LoCoMo uses gpt-4o-mini to score answers against the golden reference. We generated intentionally wrong but topically adjacent answers for all 1,540 questions and scored them using the same judge configuration and prompts used in published evaluations. The judge accepted 62.81% of them. Specific factual errors (wrong name, wrong date) were caught approximately 89% of the time. However, vague answers that identified the correct topic while missing every specific detail passed nearly two-thirds of the time. This is precisely the failure mode of weak retrieval, locating the right conversation but extracting nothing specific, and the benchmark rewards it. There is also no standardized evaluation pipeline. Each system uses its own ingestion method (arguably necessary given architectural differences), its own answer generation prompt, and sometimes entirely different models. Scores are then compared in tables as if they share a common methodology. Multiple independent researchers have documented inability to reproduce published results (EverMemOS #73, Mem0 #3944, Zep scoring discrepancy). Full audit with all 99 errors documented, methodology, and reproducible scripts: locomo-audit LongMemEval LongMemEval-S (Wang et al., 2024) is the other frequently cited benchmark. The issue is different but equally fundamental: it does not effectively isolate memory capability from context window capacity. LongMemEval-S uses approximately 115K tokens of context per question. Current models support 200K to 1M token context windows. The entire test corpus fits in a single context window for most current models. Mastra's research illustrates this: their full-context baseline scored 60.20% with gpt-4o (128K context window, near the 115K threshold). Their observational memory system scored 84.23% with the same model, largely by compressing context to fit more comfortably. The benchmark is measuring context window management efficiency rather than long-term memory retrieval. As context windows continue to grow, the full-context baseline will keep climbing and the benchmark will lose its ability to discriminate. LongMemEval-S tests whether a model can locate information within 115K tokens. That is a useful capability to measure, but it is a context window test, not a memory test. LoCoMo-Plus LoCoMo-Plus (Li et al., 2025) introduces a genuinely interesting new category: "cognitive" questions testing implicit inference rather than factual recall. These use cue-trigger pairs with deliberate semantic disconnect, the system must connect "I just adopted a rescue dog" (cue) to "what kind of pet food should I buy?" (trigger) across sessions without lexical overlap. The concept is sound and addresses a real gap in existing evaluation. The issues: It inherits all 1,540 original LoCoMo questions unchanged, including the 99 score-corrupting errors documented above. The improved judging methodology (task-specific prompts, three-tier scoring, 0.80+ human-LLM agreement) was only validated on the new cognitive questions. The original five categories retain the same broken ground truth with no revalidation. The judge model defaults to gpt-4o-mini. Same lack of pipeline standardization. The new cognitive category is a meaningful contribution. The inherited evaluation infrastructure retains the problems described above. Requirements for meaningful long-term memory evaluation Based on this analysis, we see several requirements for benchmarks that can meaningfully

performancedocumentationapisupport

reddit@[unknown]3/23/2026

I built an open-source CLI that makes your AI context portable across Claude, ChatGPT, Cursor, and Gemini via MCP

The problem I use Claude for analysis, ChatGPT for writing, Cursor for coding. Each one builds a different picture of who I am — my stack, my style, my preferences. None of them share it. When I switch tools, I start from zero. Platform memories are black boxes. You can't version them, audit them, or export them. And that's by design — it's lock-in. What I built aura is an open-source CLI that scans your machine, builds your AI identity automatically, and serves it to every tool via MCP. pip install aura-ctx aura scan # auto-detects your stack, tools, projects aura serve # starts MCP server on localhost:3847 That's it. Open Claude Desktop, ChatGPT (Developer Mode), Cursor, or Gemini CLI. They read your context automatically. No copy-paste. No re-explaining. How it works aura creates "context packs" — scoped YAML files that describe who you are in a specific domain (developer, writer, work). You control what's in them. The AI never writes to your packs without your explicit action. aura scan detects your languages, frameworks, tools, editor, projects, and git identity from your machine. aura onboard asks 5 questions to capture your style and rules. aura doctor checks your packs for bloat and stale facts. aura consolidate merges duplicates across packs. aura decay removes expired facts based on type-aware TTL. The MCP server exposes your packs as resources and tools that any MCP-compatible client can query. Security - Binds to localhost only - Optional token auth: aura serve --token - Scoped serving: aura serve --packs developer - Read-only mode: aura serve --read-only - No cloud. No telemetry. YAML files on your machine. What it's NOT This is not another memory layer for agent developers (Mem0, Zep, Letta solve that). aura is for the end user who wants to own and control their AI identity across tools. No Docker. No Postgres. No Redis. Just pip install and go. GitHub: https://github.com/WozGeek/aura-ctx PyPI: https://pypi.org/project/aura-ctx/ Happy to answer any questions. Repo submitted by /u/Miserable_Celery9917 [link] [comments]

documentationsecurityease of useopen source

Integrations

SlackZapierGoogle WorkspaceMicrosoft TeamsSalesforceTrelloNotionAirtableJiraGitHubAsanaFigmaZoomMailchimpShopifyWordPressStripeIntercomHubSpotCalendly

Categories

context engineeringagent memoryAI agentsGraph RAGcontext assembly

Repository Audit Available

Deep analysis of getzep/zep — architecture, costs, security, dependencies & more

View Full Audit

Zep Alternatives

Compare similar framework tools

All framework Tools

Browse the full category

Frequently Asked Questions

Is Zep free?▼

Yes, Zep offers a free tier. Pricing found: $125 /month, $25 / 10, $375 /month, $75 / 40, $25

What do users think of Zep?▼

Zep has an average rating of 4.3 out of 5 stars based on 2 reviews from G2, Capterra, and TrustRadius.

What are the main features of Zep?▼

Key features include: Teams building with Zep, Ingest, Graph, Assemble, Every Source, Built for Real-Time, Three Lines of Code, Chat Memory.

What is Zep used for?▼

Zep is commonly used for: Enhancing customer support interactions with personalized responses, Streamlining project management by integrating team preferences and historical data, Facilitating onboarding processes with tailored training materials based on user profiles, Improving e-commerce recommendations by analyzing user behavior and preferences, Automating report generation with context-aware insights from business data, Supporting sales teams with real-time access to client history and preferences.

What does Zep integrate with?▼

Zep integrates with: Slack, Zapier, Google Workspace, Microsoft Teams, Salesforce, Trello, Notion, Airtable, Jira, GitHub.