The most powerful AI platform for enterprises. Customize, fine-tune, and deploy AI assistants, autonomous agents, and multimodal AI with open models.
Frontier intelligence, customized to you. Make your AI your own. Train, distill, fine-tune, and build with state-of-the-art open source models. Enterprise agents with deep context. Deploy agents that execute, adapt, and deliver real results with powerful orchestration, tooling, and safety. Self-contained private deployments. Build privately anywhere on-premises, cloud, edge, devices, and more while retaining full control of your data. Deeply engaged solutioning and value delivery. Hands-on assistance from the world s foremost applied AI scientists across deployment, solutioning, safety, and beyond. Stellantis is accelerating automotive innovation with Mistral AI ASML advances silicon lithography with Mistral AI. CMA CGM streamlines global maritime operations with Mistral AI. Chat, search, analyze, create do more in one connected hub. Ship faster with purpose-built models and a stack that meets devs where they work. Build and deploy AI apps with complete control: post-train models, build custom agents, and run production AI anywhere from edge to cloud with enterprise-grade tooling. Data privacy and operational controls Turn proprietary knowledge into model intelligence by training and aligning your own models. Evaluation and lifecycle management Tailored, domain-specialized AI: from custom pre-training with your data to scaled deployment with expert guidance throughout. Deep use case exploration and applied AI Solution scaling and value realization Deploy our models in your environment, or consume as a service or from one of cloud partners. Build the future of enterprise AI Now seeking: Insatiably curious AI enthusiasts. The next chapter of AI is yours. The next chapter of AI is yours. Your personal AI assistant for everyday life and work. Get started with Le Chat. Chat, search, learn, create with Le Chat. Access to Mistral s SOTA AI models. Save and recall up to 500 memories. Enhanced productivity, higher limits, and Mistral Vibe for all-day development. More messages and web searches. More extended thinking and deep research reports. Up to 15GB of document storage. Mistral Vibe for all-day coding, PAYG beyond. State-of-the art image generation. Secure, collaborative workspace for teams building with AI. Private deployments powered by custom models, UI, and tools. We exist to make frontier AI accessible to everyone. Customizable AI solutions that improve organizational productivity and drive revenue. Customizable AI solutions that improve organizational productivity and drive revenue. How Mistral AI is being used today. Organizations worldwide have used Mistral AI to increase cost savings, improve efficiency, and increase accuracy. Here's how. Trusted by teams at companies of all sizes. Enhance your marketing ROI with AI-powered content creation and campaign optimization. Generate compelling copy, analyze market trends, and identify high-value leads while maintaining brand consistency across all marketin
Mentions (30d)
0
Reviews
0
Platforms
2
GitHub Stars
10,747
1,029 forks
Features
Industry
information technology & services
Employees
890
Funding Stage
Series C
Total Funding
$2.9B
8,055
GitHub followers
25
GitHub repos
10,747
GitHub stars
20
npm packages
40
HuggingFace models
Pricing found: $14.99, $24.99
Layman: Agentic Insight and Oversight (same same but different)
What's the most common duplicate project on r/ClaudeAI? Usage trackers. What's the second most common? AI Monitors. Does Layman do those things? Yes, of course. So what makes it different? Layman's Dashboard, Flowchart, and Logs view (with Layman's Terms and Analysis examples) Like many similar tools, Layman runs as a web service in a container on your local machine. It installs hooks and accesses harness logs to "look over your shoulder," then leverages a secondary AI instance to help keep your multiple sessions, sub-agents, and alternate harnesses in line. So, short answer: Drift Monitoring. Repeatedly named as one of the most frustrating issues for heavy Claude Code users, Layman takes into account all user prompts issued to CC as well as current project and global CLAUDE.md instructions, and at configurable intervals scores the current degree of "drift" occurring from your goals and the rules you have established. You can optionally receive warning notifications or place a block when different thresholds are reached. Risk Analysis. Layman will classify all tool calls and operations with a "risk" level based on simple, consistent criteria (such as read-only, writing, modifying, network access, deletion, etc.) and can automatically analyze the AI agent's current intended action, the overall goal or purpose behind that intention, and summarize the safety and security implications at stake. Layman's Terms. The eponymous origin of the tool, offering a plain-language (and if possible non-technical) explanation of the purpose of any given tool call. It can summarize what was performed at the session level as well, helpful for later recall and understanding after some time has passed. Vibe coders aside, should a professional developer already have knowledge of what their tools are doing before they grant permission? Yes, of course, but when you are operating at scale and (say) that TypeScript project you are polishing needs to look up some JSON value and your AI agent writes a one-off Python script to parse it out, it can be helpful to have an "extra pair of eyes" taking a look before you effectively begin yet-another code review. Meanwhile, typical features you might come to expect are included, from Session Recording (opt-in is required first for data tracking and there is no telemetry to worry about), Bookmarking, and Search, PII filtering (including PATs and API keys), File and URL access tracking, and a handy Setup Wizard for helping get those hooks installed in the first place and walking you through configuration of core capabilities. Did I mention besides Claude Code it supports Codex, OpenCode, Mistral Vibe, and Cline (with more to come)? Whether using these for local agents or as an alternative when hitting session limits, Layman can monitor and track them all at once. But wait, doesn't a "secondary AI instance" just end up wasting tokens? My Precious? (erm...) Our precious, precious tokens? When session limits already hit so hard? It turns out these algorithms do not require nearly the level of "intelligence" you might desire for your planning and coding sessions themselves. Personally I keep an instance of Qwen3-Coder-Next running locally via llama.cpp server on my system's GPU to field those calls, with no discernible impact on system performance. And when a local LLM is not available, Haiku does the job excellently (now you have a reason to use it). You absolutely do not need to use anything more resource-intensive to get the job done. Now you have a complete picture. GitHub repository: https://github.com/castellotti/layman License: MIT submitted by /u/jigsaw-studio [link] [comments]
View originalRunning Claude Code TUI against local models via protocol translation — sharing my approach
I've been working on OwlCC, a protocol proxy that lets you run Claude Code's complete terminal UI — all 25+ tools (Bash, Read, Edit, Write, Glob, Grep, WebSearch...) and 40+ commands — against your own local models. How it works: Claude Code speaks the Anthropic Messages API. OwlCC sits in the middle, translates Anthropic protocol to OpenAI Chat Completions on the fly, and routes to whatever local backend you're running. Claude Code doesn't know the difference. Your prompt → OwlCC proxy (:8019) → Anthropic-to-OpenAI translation → Your local backend → Local models What you get that official Claude Code doesn't have: Any model — Qwen, Llama, Mistral, DeepSeek, MiniMax, whatever you can serve /model hot-swap — switch between models mid-conversation (see screenshot) 100% local — nothing leaves your machine, no API key, no account Local web search — SearXNG replaces Anthropic's cloud search, fully self-hosted Observability — Prometheus metrics, audit log, request tracing, error budgets Multi-backend resilience — circuit breaker, fallback chains, health monitoring Learned skills — auto-synthesizes reusable skills from your coding sessions (42 skills and counting) Training data pipeline — auto-collect, quality scoring, PII sanitization, multi-format export What you lose vs official: No extended thinking (local models don't support it) Model quality depends on what you run — a 7B model won't match Claude Opus No official support The setup: It requires the Claude Code TypeScript source tree (not the compiled npm package — you need to bring your own). OwlCC launches it via Node.js + tsx with ESM loader hooks that redirect 22 cloud-only modules to local stubs. The upstream source is pinned locally — Anthropic updates don't affect you. Full tool use driving a Java build + local SearXNG web search /model switching between 5 local models + /skills showing 42 learned skills git clone https://github.com/yeemio/owlcc-byoscc.git cd owlcc-byoscc # place your CC source at upstream/claude-code/ npm install && npm run build npx owlcc init # auto-detects your local backends npx owlcc Tech stack: TypeScript, 120+ source files, 1652 tests, Apache 2.0. GitHub: https://github.com/yeemio/owlcc-byoscc Happy to answer questions about the architecture (the ESM loader chain that makes this work is kind of interesting). submitted by /u/Single_Mushroom2043 [link] [comments]
View originalPencil Bench (multi step reasoning benchmark)
DeepSeek was a scam from the beginning submitted by /u/DigSignificant1419 [link] [comments]
View originalI built an AI content engine that turns one piece of content into posts for 9 platforms — fully automated with n8n
What it does: You give it any input — a blog URL, a YouTube video, raw text, or just a topic — and it generates optimized posts for 9 platforms at once: Instagram, Twitter/X, LinkedIn, Facebook, TikTok, Reddit, Pinterest, Twitter threads, and email newsletters. Each output is tailored to the platform (hashtags for IG, hooks for TikTok, professional tone for LinkedIn, etc.). It also auto-generates images for visual platforms like Instagram, Facebook, and Pinterest,using AI. Other features: - Topic Research — scans Google, Reddit, YouTube, and news sources, then uses an LLM to identify trending subtopics before generating content - Auto-Discover — if you don't even have a topic, it searches what's trending right now (optionally filtered by niche) and picks the hottest one - Cinematic Ad — upload any photo, pick a style (cinematic, luxury, neon, retro, minimal, natural), and Gemini transforms it into a professional-looking ad - Multi-LLM support — works with Mistral, Groq, OpenAI, Anthropic, and Gemini - History — every generation is saved, exportable as CSV The n8n automation (this is where it gets fun): I connected the whole thing to an n8n workflow so it runs on autopilot: 1. Schedule Trigger — fires daily (or whatever frequency) 2. Google Sheets — reads a row with a topic (or "auto" to let AI pick a trending topic) 3. HTTP Request — hits my /api/auto-generate endpoint, which auto-detects the input type (URL, YouTube link, topic, or "auto") and generates everything 4. Code node — parses the response and extracts each platform's content 5. Google Drive — uploads generated images 6. Update Sheets — marks the row as done with status and links The API handles niche filtering too — so if my sheet says the topic is "auto" and the niche column says "AI", it'll specifically find trending AI topics instead of random viral stuff. Error handling: HTTP Request has retry on fail (2 retries), error outputs route to a separate branch that marks the sheet row as "failed" with the error message, and a global error workflow emails me if anything breaks. Tech stack: - FastAPI backend, vanilla JS frontend - Hosted on Railway - Google Gemini for image generation and cinematic ads - HuggingFace FLUX.1 for platform images - SerpAPI + Reddit + YouTube + NewsAPI for research - SQLite for history - n8n for workflow automation It's not perfect yet — rate limits on free tiers are real — but it's been saving me hours every week. Happy to answer questions. https://preview.redd.it/f8d3ogk3nktg1.png?width=888&format=png&auto=webp&s=dcd3d5e90facd54314f40e799b32cab979dae4bf https://preview.redd.it/j8zl07llmktg1.png?width=946&format=png&auto=webp&s=5c78c12a223d6357cccaed59371e97d5fe4787f5 https://preview.redd.it/5cjas6hkmktg1.png?width=891&format=png&auto=webp&s=288c6964061f531af63fb9717652bececfb63072 https://preview.redd.it/k7e89belmktg1.png?width=1057&format=png&auto=webp&s=8b6cb15cfa267d90a697ba03aed848166976d921 https://preview.redd.it/3w3l70tlmktg1.png?width=1794&format=png&auto=webp&s=6de10434f588b1bf16ae02f542afd770eaa23c3f https://preview.redd.it/a40rh1canktg1.png?width=1920&format=png&auto=webp&s=1d2414c7e653a5f01f12a21a43e69bd4fb4b99ed submitted by /u/emprendedorjoven [link] [comments]
View originalI built a CLI that installs MCP, skills, prompts, commands and sub-agents into any AI tool (Cursor, Claude Code, Windsurf, etc.)
Install Sub-agents, Skills, MCP Servers, Slash Commands and Prompts Across AI Tools with agent-add agent-add lets you install virtually every type of AI capability across tools — so you can focus on what to install and where, without worrying about each tool's config file format. https://preview.redd.it/kemovi39qitg1.jpg?width=1964&format=pjpg&auto=webp&s=b994b81f343ee01afdf23392e13e0d472c71a47d It's especially useful when: You're an AI capability developer shipping MCP servers, slash commands, sub-agents, or skills Your team uses multiple AI coding tools side by side You can also use agent-add simply to configure your own AI coding tool — no need to dig into its config file format. Getting Started agent-add runs directly via npx — no install required: npx -y agent-add --skill 'https://github.com/anthropics/skills.git#skills/pdf' agent-add requires Node.js. Make sure it's installed on your machine. Here's a more complete example: npx -y agent-add \ --mcp '{"playwright":{"command":"npx","args":["-y","@playwright/mcp"]}}' \ --mcp 'https://github.com/modelcontextprotocol/servers.git#.mcp.json' \ --skill 'https://github.com/anthropics/skills.git#skills/pdf' \ --prompt $'# Code Review Rules\n\nAlways review for security issues first.' \ --command 'https://github.com/wshobson/commands.git#tools/security-scan.md' \ --sub-agent 'https://github.com/VoltAgent/awesome-claude-code-subagents.git#categories/01-core-development/backend-developer.md' For full usage details, check the project README, or just run: npx -y agent-add --help Project & Supported Tools The source code is hosted on GitHub: https://github.com/pea3nut/agent-add Here's the current support matrix: AI Tool MCP Prompt Skill Command Sub-agent Cursor ✅ ✅ ✅ ✅ ✅ Claude Code ✅ ✅ ✅ ✅ ✅ Trae ✅ ✅ ✅ ❌ ❌ Qwen Code ✅ ✅ ✅ ✅ ✅ GitHub Copilot ✅ ✅ ✅ ✅ ✅ Codex CLI ✅ ✅ ✅ ❌ ✅ Windsurf ✅ ✅ ✅ ✅ ❌ Gemini CLI ✅ ✅ ✅ ✅ ✅ Kimi Code ✅ ✅ ✅ ❌ ❌ Augment ✅ ✅ ✅ ✅ ✅ Roo Code ✅ ✅ ✅ ✅ ❌ Kiro CLI ✅ ✅ ✅ ❌ ✅ Tabnine CLI ✅ ✅ ❌ ✅ ❌ Kilo Code ✅ ✅ ✅ ✅ ✅ opencode ✅ ✅ ✅ ✅ ✅ OpenClaw ❌ ✅ ✅ ❌ ❌ Mistral Vibe ✅ ✅ ✅ ❌ ❌ Claude Desktop ✅ ❌ ❌ ❌ ❌ submitted by /u/pea3nut [link] [comments]
View original[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.
The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome. I decided to fix this from the ground up. What is Dante-2B A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs. Architecture: LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio) SwiGLU FFN, RMSNorm, RoPE d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200) Weight-tied embeddings, no MoE — all 2.1B params active per token Custom 64K BPE tokenizer built specifically for Italian + English + code Why the tokenizer matters This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead. Dante's tokenizer was trained on a character-balanced mix (~42% Italian, ~36% English, ~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck. Small detail, massive impact on efficiency and quality for Italian text. Training setup Data: ~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers. Phase 1 (just completed): 100B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. ~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU. Phase 2 (in progress): Extending to 4096 context with 20B more tokens at reduced LR. Should take ~4-7 more days. What it can do right now After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale. I'll share samples after Phase 2, when the model has full 4K context. What's next Phase 2 completion (est. ~1 week) HuggingFace release of the base model — weights, tokenizer, config, full model card SFT phase for instruction following (Phase 3) Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes Why I'm posting now I want to know what you'd actually find useful. A few questions for the community: Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you. What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know. Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately? Training logs / loss curves? Happy to share the full training story with all the numbers if there's interest. About me I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at LUISS university, and I run an innovation company (LEAF) that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience. Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub. Happy to answer any questions. 🇮🇹 Discussion also on r/LocalLLaMA here submitted by /u/angeletti89 [link] [comments]
View originalUsing AI to untangle 10,000 property titles in Latam, sharing our approach and wanting feedback
Hey. Long post, sorry in advance (Yes, I used an AI tool to help me craft this post in order to have it laid in a better way). So, I've been working on a real estate company that has just inherited a huge mess from another real state company that went bankrupt. So I've been helping them for the past few months to figure out a plan and finally have something that kind of feels solid. Sharing here because I'd genuinely like feedback before we go deep into the build. Context A Brazilian real estate company accumulated ~10,000 property titles across 10+ municipalities over decades, they developed a bunch of subdivisions over the years and kept absorbing other real estate companies along the way, each bringing their own land portfolios with them. Half under one legal entity, half under a related one. Nobody really knows what they have, the company was founded in the 60s. Decades of poor management left behind: Hundreds of unregistered "drawer contracts" (informal sales never filed with the registry) Duplicate sales of the same properties Buyers claiming they paid off their lots through third parties, with no receipts from the company itself Fraudulent contracts and forged powers of attorney Irregular occupations and invasions ~500 active lawsuits (adverse possession claims, compulsory adjudication, evictions, duplicate sale disputes, 2 class action suits) Fragmented tax debt across multiple municipalities A large chunk of the physical document archive is currently held by police as part of an old investigation due to old owners practices The company has tried to organize this before. It hasn't worked. The goal now is to get a real consolidated picture in 30-60 days. Team is 6 lawyers + 3 operators. What we decided to do (and why) First instinct was to build the whole infrastructure upfront, database, automation, the works. We pushed back on that because we don't actually know the shape of the problem yet. Building a pipeline before you understand your data is how you end up rebuilding it three times, right? So with the help of Claude we build a plan that is the following, split it in some steps: Build robust information aggregator (does it make sense or are we overcomplicating it?) Step 1 - Physical scanning (should already be done on the insights phase) Documents will be partially organized by municipality already. We have a document scanner with ADF (automatic document feeder). Plan is to scan in batches by municipality, naming files with a simple convention: [municipality]_[document-type]_[sequence] Step 2 - OCR Run OCR through Google Document AI, Mistral OCR 3, AWS Textract or some other tool that makes more sense. Question: Has anyone run any tool specifically on degraded Latin American registry documents? Step 3 - Discovery (before building infrastructure) This is the decision we're most uncertain about. Instead of jumping straight to database setup, we're planning to feed the OCR output directly into AI tools with large context windows and ask open-ended questions first: Gemini 3.1 Pro (in NotebookLM or other interface) for broad batch analysis: "which lots appear linked to more than one buyer?", "flag contracts with incoherent dates", "identify clusters of suspicious names or activity", "help us see problems and solutions for what we arent seeing" Claude Projects in parallel for same as above Anything else? Step 4 - Data cleaning and standardization Before anything goes into a database, the raw extracted data needs normalization: Municipality names written 10 different ways ("B. Vista", "Bela Vista de GO", "Bela V. Goiás") -> canonical form CPFs (Brazilian personal ID number) with and without punctuation -> standardized format Lot status described inconsistently -> fixed enum categories Buyer names with spelling variations -> fuzzy matched to single entity Tools: Python + rapidfuzz for fuzzy matching, Claude API for normalizing free-text fields into categories. Question: At 10,000 records with decades of inconsistency, is fuzzy matching + LLM normalization sufficient or do we need a more rigorous entity resolution approach (e.g. Dedupe.io)? Step 5 - Database Stack chosen: Supabase (PostgreSQL + pgvector) with NocoDB on top Three options were evaluated: Airtable - easiest to start, but data stored on US servers (LGPD concern for CPFs and legal documents), limited API flexibility, per-seat pricing NocoDB alone - open source, self-hostable, free, but needs server maintenance overhead Supabase - full PostgreSQL + authentication + API + pgvector in one place, $25/month flat, developer-first We chose Supabase as the backend because pgvector is essential for the RAG layer (Step 7) and we didn't want to manage two separate databases. NocoDB sits on top as the visual interface for lawyers and data entry operators who need spreadsheet-like interaction without writing SQL. Each lot becomes a single entity (primary key) with relational links to: contracts, bu
View originalI built an open source MCP server that aggregates 29 sports APIs into 319 tools, now on the MCP Registry
Hey everyone, I just published mcp-sports-hub to the official MCP Registry, It's fully open source (MIT) and built to be extended by the community. Works with any LLM (Claude, GPT, Gemini, Llama, Mistral, etc.). What it does: A single MCP server that connects LLM to 29 sports data providers at once — scores, stats, live odds, esports, F1 telemetry, college sports, cricket, golf, and more. Key features: - 9 providers work instantly with zero API keys - 20 more providers available with free-tier keys (signup takes ~2 min for free) - Missing keys don't block startup — tools just return an error when called - Provider presets: `free`, `us-major`, `soccer`, `f1`, `odds`, `esports`... Install: npm install mcp-sports-hub Or find it directly on registry.modelcontextprotocol.io Repo: github.com/lacausecrypto/mcp-sports-hub Feedback and PRs very welcome — especially if you want a new provider added! submitted by /u/Main-Confidence7777 [link] [comments]
View originalI built a proxy that optimizes your prompts before they hit the LLM — cut ~24% of tokens without changing output quality
I've been working on PithToken — an OpenAI-compatible API proxy that sits between your app and the LLM provider. It analyzes your prompt, strips filler words and verbose patterns, then forwards the leaner version. How it works: You point your SDK to https://api.pithtoken.ai/v1 instead of the provider URL PithToken receives the prompt, runs a two-pass optimization (filler removal → verbose pattern replacement) The optimized prompt goes to OpenAI / Anthropic / OpenRouter using your own API key Response comes back unchanged What it doesn't do: It doesn't alter the meaning of your prompt It doesn't store your prompt content (pass-through only, metadata logged for analytics) It never inflates — if optimization can't improve the prompt, it forwards as-is Current numbers: On English prompts with typical conversational filler, we're seeing ~24% token reduction. Technical/code prompts see less savings (~5-8%) since they're already lean. Integration is literally 2 lines: python client = OpenAI( api_key="pt-your-key", base_url="https://api.pithtoken.ai/v1" ) Everything else in your code stays exactly the same. Works with any OpenAI-compatible SDK, Anthropic SDK, LangChain, LlamaIndex, Continue, Cursor, Claude Code, cURL — anything that lets you set a base URL. We also just added OpenRouter support, so you can route to 200+ models (Llama, Mistral, Gemma, DeepSeek, etc.) through the same proxy with the same optimization. Free tier available, no credit card required. Would appreciate any feedback. submitted by /u/talatt [link] [comments]
View originalOpenAI should just open-source text-davinci-003 at this point
Hear me out. The model is deprecated. It's not making OpenAI money anymore. Nobody is actively building new products on it. It's basically a museum piece at this point. But researchers and hobbyists still care about it — a lot. text-davinci-003 was a genuinely important milestone. It was one of the first models where you really felt like something had clicked. People did incredible things with it. Letting it quietly rot on the deprecated shelf feels like a waste. xAI open-sourced Grok-1 when they were done with it. Meta releases Llama weights. Mistral drops models constantly. OpenAI already put out GPT OSS, which is great — but that's a current generation model. I'm talking about legacy stuff that has zero commercial risk to release. text-davinci-003 specifically would be huge for the research community. People still study it, write papers about it, try to reproduce it. Actually having the weights would be a gift to anyone doing interpretability work or trying to understand how RLHF shaped early GPT behavior. There's no downside at this point. The model is old. It's not competitive. Nobody is going to build a product on it and undercut OpenAI. It would just be a nice thing to do for the community that helped make these models matter in the first place. Anyway. Probably wishful thinking. But it would be cool. submitted by /u/Ok-Type-7663 [link] [comments]
View originalNew Method of Testing AI's Comprehensiveness
I saw a post on a new LinkedIn translator that turns text in "English" like: I took a massive dump at work just now, into something professional (see prompt below). I prompted 15 different models ranging from OpenAI, Google, Mistral, NVIDIA, Minimax, and more, and so far I've only had good responses (meaning they correctly identify the hidden meaning + point out specific parts to translate into normal talk) from Gemini 3 Flash and ChatGPT 5.4 Instant. Here's the prompt: "What is actually being said here? There's a hidden meaning between the lines: I just prioritized a major internal release during business hours. It's all about clearing the backlog to make room for new opportunities and maintaining a high-performance workflow. Grateful for the space to focus on what truly matters. #Efficiency #Output #GrowthMindset" I'm going to figure out how to turn this into a benchmark. submitted by /u/Such-Difference6743 [link] [comments]
View originalI tested 11 LLMs on the same fiction project. Opus was the only one that felt like it was building an actual novel.
I tested 11 models across 4 buckets (flagship, fast/cheap, open-weight creative, specialist fiction) using the same project, same chapter workflow, and same evaluation rubric — weighted across voice consistency, emotional logic, structural coherence, and AI-artifact density. Most of them could produce decent chapter-level output. Opus was the only one that consistently felt like it was helping build a whole book, not just generating chapter-shaped text. Quick model notes: GPT-5.2 — Very clean, technically competent prose. Almost pre-copy-edited. But emotionally flat in a consistent way. Everything came out at roughly the same temperature. Gemini — Capable, but drifted more. Character voice would subtly shift between chapters, or it would over-explain things the reader already understood. Usable, but needed heavier correction. Open-weight (Llama/Mistral etc.) — Good scenes, but struggled with emotional continuity and character dynamics across a full chapter. Specialist fiction (NovelAI etc.) — Stronger sentence-level instincts than people give them credit for, but weaker structural judgment. Nice writing that didn't always serve the scene. What Opus did differently: It tracked emotional logic, not just plot beats. If a character was suppressing something, Opus was better at expressing that through rhythm, omission, and restraint — not just stating the feeling. It made cross-chapter connections. Small details would come back later with more weight. Sometimes it introduced motifs I hadn't planned, and some were genuinely useful. It responded much better to demonstration than instruction. This was the biggest finding of the whole test. Long analytical instructions like "restrained emotion, varied sentence length, avoid purple prose" generally made output worse across every model I tested. What worked was showing 15–20 examples of what I wanted plus a few of what I didn't. Opus picked up that pattern faster and held it more consistently than anything else. Sonnet vs. Opus: Sonnet 4.6 was actually close. On raw prose quality, maybe 90–95% of Opus at roughly 60% of the cost. Where Opus pulled ahead was over a long run: fewer regenerations, fewer flat chapters, less voice drift. For a shorter project or tighter budget, I'd seriously consider Sonnet. For a full novel, I preferred Opus. Where Opus still struggled: Crowded scenes with 4+ characters. Classic LLM habits, em-dash addiction, overdone sensory transitions, occasional object-anthropomorphizing. And zero real self-evaluation ability. The human judgment layer was essential throughout. Bottom line: I wouldn't say "Opus can write a novel." I'd say it was the best model I tested at generating chapters that felt like they belonged to the same book. That difference mattered more than sentence quality alone. Happy to answer questions about setup, rubric, prompt design, or where the other models actually did better. The finished novel is up on Wattpad if anyone wants to judge the output I can drop a link in comments. submitted by /u/CreativeStretch9591 [link] [comments]
View originalI'm a designer who vibecoded a full desktop AI agent with Claude - it has a Clippy-style buddy that lives on your screen
3 weeks ago I couldn't open a Terminal. Today I have a full desktop AI agent on GitHub. I'm a designer from Vienna. Claude (and vibecoding) made this possible. I built Skales, a native desktop app (.exe/.dmg) that gives non-technical people their own local AI agent. My mom runs it. My 6yr old makes games with it. What it does: ReAct autopilot, bi-temporal memory that learns about you over time, Execute Mode for multi-step autonomous tasks, browser automation via Playwright, native integrations (Gmail, Telegram, WhatsApp, Discord, Google Calendar), multi-agent group chat where different AI models debate a topic, image and video generation, and a desktop buddy that sits on your screen when you minimize - like Clippy but actually useful. Works with Claude, Gemini, Grok, Ollama, OpenRouter, Mistral, DeepSeek. BYOK. ~300MB idle RAM, everything stored locally. Source-available on GitHub (BSL-1.1). The whole point was: download, install, done. No Docker, no Terminal, no npm. AI agents for normal humans, from 6 to 60+. Just launched on r/LocalLLaMA and getting real feedback. Figured the Claude community should see it too since Claude is literally the reason this exists (and partially Gemini). github.com/skalesapp/skales What would you want a desktop agent to do? Honest feedback welcome. submitted by /u/yaboyskales [link] [comments]
View originalRepository Audit Available
Deep analysis of mistralai/mistral-src — architecture, costs, security, dependencies & more
Yes, Mistral Open offers a free tier. Pricing found: $14.99, $24.99
Key features include: Why Mistral, Explore, Build, Legal.
Mistral Open has a public GitHub repository with 10,747 stars.
Based on 18 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.