Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Industry
information technology & services
Employees
300
Funding Stage
Series E
Total Funding
$696.1M
Using AI to untangle 10,000 property titles in Latam, sharing our approach and wanting feedback
Hey. Long post, sorry in advance (Yes, I used an AI tool to help me craft this post in order to have it laid in a better way). So, I've been working on a real estate company that has just inherited a huge mess from another real state company that went bankrupt. So I've been helping them for the past few months to figure out a plan and finally have something that kind of feels solid. Sharing here because I'd genuinely like feedback before we go deep into the build. Context A Brazilian real estate company accumulated ~10,000 property titles across 10+ municipalities over decades, they developed a bunch of subdivisions over the years and kept absorbing other real estate companies along the way, each bringing their own land portfolios with them. Half under one legal entity, half under a related one. Nobody really knows what they have, the company was founded in the 60s. Decades of poor management left behind: Hundreds of unregistered "drawer contracts" (informal sales never filed with the registry) Duplicate sales of the same properties Buyers claiming they paid off their lots through third parties, with no receipts from the company itself Fraudulent contracts and forged powers of attorney Irregular occupations and invasions ~500 active lawsuits (adverse possession claims, compulsory adjudication, evictions, duplicate sale disputes, 2 class action suits) Fragmented tax debt across multiple municipalities A large chunk of the physical document archive is currently held by police as part of an old investigation due to old owners practices The company has tried to organize this before. It hasn't worked. The goal now is to get a real consolidated picture in 30-60 days. Team is 6 lawyers + 3 operators. What we decided to do (and why) First instinct was to build the whole infrastructure upfront, database, automation, the works. We pushed back on that because we don't actually know the shape of the problem yet. Building a pipeline before you understand your data is how you end up rebuilding it three times, right? So with the help of Claude we build a plan that is the following, split it in some steps: Build robust information aggregator (does it make sense or are we overcomplicating it?) Step 1 - Physical scanning (should already be done on the insights phase) Documents will be partially organized by municipality already. We have a document scanner with ADF (automatic document feeder). Plan is to scan in batches by municipality, naming files with a simple convention: [municipality]_[document-type]_[sequence] Step 2 - OCR Run OCR through Google Document AI, Mistral OCR 3, AWS Textract or some other tool that makes more sense. Question: Has anyone run any tool specifically on degraded Latin American registry documents? Step 3 - Discovery (before building infrastructure) This is the decision we're most uncertain about. Instead of jumping straight to database setup, we're planning to feed the OCR output directly into AI tools with large context windows and ask open-ended questions first: Gemini 3.1 Pro (in NotebookLM or other interface) for broad batch analysis: "which lots appear linked to more than one buyer?", "flag contracts with incoherent dates", "identify clusters of suspicious names or activity", "help us see problems and solutions for what we arent seeing" Claude Projects in parallel for same as above Anything else? Step 4 - Data cleaning and standardization Before anything goes into a database, the raw extracted data needs normalization: Municipality names written 10 different ways ("B. Vista", "Bela Vista de GO", "Bela V. Goiás") -> canonical form CPFs (Brazilian personal ID number) with and without punctuation -> standardized format Lot status described inconsistently -> fixed enum categories Buyer names with spelling variations -> fuzzy matched to single entity Tools: Python + rapidfuzz for fuzzy matching, Claude API for normalizing free-text fields into categories. Question: At 10,000 records with decades of inconsistency, is fuzzy matching + LLM normalization sufficient or do we need a more rigorous entity resolution approach (e.g. Dedupe.io)? Step 5 - Database Stack chosen: Supabase (PostgreSQL + pgvector) with NocoDB on top Three options were evaluated: Airtable - easiest to start, but data stored on US servers (LGPD concern for CPFs and legal documents), limited API flexibility, per-seat pricing NocoDB alone - open source, self-hostable, free, but needs server maintenance overhead Supabase - full PostgreSQL + authentication + API + pgvector in one place, $25/month flat, developer-first We chose Supabase as the backend because pgvector is essential for the RAG layer (Step 7) and we didn't want to manage two separate databases. NocoDB sits on top as the visual interface for lawyers and data entry operators who need spreadsheet-like interaction without writing SQL. Each lot becomes a single entity (primary key) with relational links to: contracts, bu
View originalI built an open-source MCP memory server that gives Claude persistent memory with auto-graph and semantic search
I've been building a personal knowledge system called Open Brain and just open-sourced it. It's an MCP server that gives Claude (Code, Desktop, or any MCP client) persistent memory across sessions. What it does: You tell Claude to "remember this" and it captures the thought — embedding it, extracting entities (people, tools, projects, orgs), scoring quality, checking for semantic duplicates, and auto-linking to related thoughts. Later you search by meaning, not keywords. What makes it different from other MCP memory tools: Auto-graph — connections between thoughts are created automatically on capture. Typed links (extends, contradicts, is-evidence-for) at 0.80+ similarity. No manual linking. Semantic dedup — captures at 0.92+ similarity auto-merge instead of creating duplicates Salience scoring — 6-factor ranking (recency, access frequency, connections, merges, source weight, pinned). Thoughts you actually use rise to the top over time. Hybrid search — BM25 full-text + pgvector cosine similarity with Reciprocal Rank Fusion. Handles both exact terms and meaning. 16 MCP tools — not just store/recall. Graph traversal, entity browsing, weekly review synthesis, staleness pruning, dedup review, density analysis. Staleness pruning — thoughts that become irrelevant decay and get soft-archived automatically. LLM-confirmed, with sole-entity protection so you don't lose knowledge. Stack: Supabase (Postgres + pgvector) + Deno Edge Functions + OpenRouter. Self-hostable — you own your data, runs on your own Supabase project. Setup is ~10 minutes: clone, run bootstrap (interactive secret setup), run deploy (schema + functions), run validate (8-check verification). The deploy script prints a ready-to-paste claude mcp add command. Works with Claude Code, Claude Desktop, ChatGPT, and any MCP-compatible client. MIT licensed, 40 SQL migrations, 5 Edge Functions, 138 tests. GitHub: https://github.com/Bobby-cell-commits/open-brain-server Happy to answer questions about the architecture or how the auto-graph/salience scoring works under the hood. submitted by /u/midgyrakk [link] [comments]
View originalMy chatbot switches from text to voice mid-conversation. same memory, same context, you just start talking. 2 months of Claude, open-sourcing it for you to try.
been building this since late january. started as a weekend RAG chatbot so visitors could ask about my work. it answers from my case studies. that part was straightforward. then i kept going and it turned into the best learning experience i've had with Claude. still a work in progress. there are UI bugs i'm fixing and voice mode has edge cases. but the architecture is solid and you can try it right now. the whole thing was built with Claude Code. the chatbot runs on Claude Sonnet, and Claude Code wrote most of the codebase including the eval framework. two months of building every other day and i've learned more about production LLM systems than in any course. here's what's in it: streaming responses. tokens come in one by one, not dumped as a wall of text. i tuned the speed so you can actually follow along as it writes. fast enough to feel responsive, slow enough to read comfortably. like watching it think. text to voice mid-conversation. you're chatting with those streaming responses, and at any point you hit the mic and just start talking. same context, same memory. OpenAI Realtime API handles speech-to-speech. keeping state synced between both modes was the hardest part to get right. RAG with contextual links. the chatbot doesn't just answer. when it pulls from a case study, it shows you a clickable link to that article right in the conversation. every new article i publish gets indexed automatically via RAG. i don't touch the prompt. the chatbot learns new content on its own just by me publishing it. 71 automated evals across 10 categories. factual accuracy, safety/jailbreak, RAG quality, source attribution, multi-turn, voice quality. every PR runs the full suite. i broke prod twice before building this. 53 of the 71 evals exist because something actually broke. the system writes tests from its own failures. 6-layer defense against prompt injection. keyword detection, canary tokens, fingerprinting, anti-extraction, online safety scoring (Haiku rates every response in background), and an adversarial red team that auto-generates 20+ attack variants. someone tried to jailbreak it after i shared it on linkedin. that's when i took security seriously. observability dashboard. every decision the pipeline makes gets traced in Langfuse: tool_decision, embedding, retrieval, reranking, generation. built a custom dashboard with 8 tabs to monitor it all. stack: Claude Sonnet (generation + tool_use), OpenAI embeddings (pgvector), Haiku (background safety scoring), Langfuse, Supabase, Vercel. like i said, it's not perfect. some UI rough edges, voice mode still needs polish on certain browsers. but the core works and everything is in the repo. repo: github.com/santifer/cv-santiago (the repo has everything. RAG pipeline, defense layers, eval suite, prompt templates, voice mode). feel free to clone and try. happy to answer questions. submitted by /u/Beach-Independent [link] [comments]
View originalThe Vectorized/Semantic 2nd Brain You Know You Need
I started this because from day one, I sensed (like any decent developer or human with half-a-brain) that context engineering alone, or even a decent "saddle" as people are calling it, weren't going to get me where I wanted to go. Around the same time, I discovered my bald brother Nate B. Jones (AI News & Strategy analyst) through a YouTube video he made about creating a "$0.10/month second brain" on Supabase + pgvector + MCP. So yeah... I'm a freaking genius (Claude told me) so I got the basic version running in an afternoon. Then I couldn't stop. The project is cerebellum — a personal, database-backed memory system that speaks MCP, and reads/writes/searches like an LLM (i.e. semantically), so any AI tool (Claude Code, Cursor, ChatGPT, Gemini, whatever ships next year) can query the same memory store without any integration work. One protocol, every engine. I realize in some circles, everyone and their mom is either trying to build something like this, or they're skirting around the idea and just haven't gotten there yet. So, I wasn't going to share it but it's just been so useful for me that it feels wrong not to. So, here's what the architecture of what I've built actually looks like, why it took a lot longer than an afternoon, and the ways in which it may be helpful for you (and different/better than whatever you've been using): Three layers between a raw thought and permanent storage: 1. The Operator (aka "Weaver", "Curator", "Compiler", etc.) Going for a Matrix type name to accompany and try and match the bad-assery of the "Gatekeeper" (see below), but I haven't been able to. Suggestions are encouraged -- this one has been eating at me. Every capture — from the CLI or any AI tool — lands in a buffer/web before it touches the database. The Operator is an LLM running against that buffer (or "crawling", catching, and synthesizing/"sewing" thoughts from the web as I like to imagine) that makes one of three calls: pass-through: complete, self-contained thought → route to the next layer hold: low-signal fragment → sit in the buffer, wait for related captures to arrive synthesise: 2+ buffered entries share a theme → collapse them into one stronger insight, discard the fragments So if I jot three half-baked notes about a decision I'm wrestling with, the Operator catches and holds onto them. When the pattern solidifies, it compiles one coherent thought and routes that downstream. The fragments never reach the database. The whole buffer runs on a serialized async chain so concurrent captures don't corrupt each other, and TTL expiry never silently discards — expired entries route individually if synthesis fails. I'll probably mention it again, but the race conditions and other issues that arose out of building this funnel are definitely the most interesting problems I've faced so far (aside from naming things after the Matrix + brain stuff)... 2. The Gatekeeper What survives the Operator hits a second LLM evaluation. The GK scores each thought 1–10 (Noise → Insight-grade), generates an adversarial note for borderline items, checks for contradictions against existing thoughts in the DB, and flags veto violations — situations where a new capture would contradict a directive I've already marked as inviolable. It outputs a recommendation (keep, drop, improve, or "axiom") and a reformulation if it thinks the thought can be sharper. By the way, axiom is the idiotic neural-esque term I came up with for a permanent directive that bypasses the normal filtering pipeline and tells every future AI session: "this rule is non-negotiable." You can capture one with memo --axiom "..." — it skips the Operator entirely, goes straight to your review queue, and once approved, the Gatekeeper actively flags any future capture that would contradict it. It's not just stored differently, it's enforced differently. TLDR; an axiom is a rule carved in stone, not a note on a whiteboard. A first class thought, if you will. 3. User ("the Architect" 🥸) I have the final say on everything. But I didn't want to have to always give that "say" during the moment I capture a thought. Hence, running memo review walks me through the queue. For each item: score, analysis, the skeptic's note if it's borderline, suggested reformulation. I keep, drop, edit, or promote to axiom. Nothing reaches the database without explicit sign-off. Where is it going? The part I'm most excited about is increasing the scope of cerebellum's observability to make it truly "watchable", so I can take my hands off the wheel (aside from making a final review). The idea: point it at any app — a terminal session, your editor, a browser tab, a desktop app — and have it observe passively. When it surfaces something worth capturing, the Operator handles clustering and synthesis; only what's genuinely signal makes it to the GK queue; I get final say. You could maintain a list of apps cerebellum is watching and tune the TTL and synthesis behavior per s
View original