Discover Llama 4
Based on the provided social mentions, users view Llama 3 as a capable and accessible open-source model that performs well in various applications. The model demonstrates strong technical performance, with Llama 3.2 3B achieving ~85% on SimpleQA benchmarks when paired with research APIs, and showing 40% higher structural stability compared to GPT-2 in manifold audits. Users particularly appreciate its versatility for local deployment, experimenting with multi-agent systems, and research applications without requiring cloud APIs or fine-tuning. The sentiment suggests Llama 3 is seen as a solid, cost-effective alternative to proprietary models, especially for developers building autonomous AI systems and conducting research experiments.
Mentions (30d)
13
Reviews
0
Platforms
3
GitHub Stars
29,294
3,524 forks
Based on the provided social mentions, users view Llama 3 as a capable and accessible open-source model that performs well in various applications. The model demonstrates strong technical performance, with Llama 3.2 3B achieving ~85% on SimpleQA benchmarks when paired with research APIs, and showing 40% higher structural stability compared to GPT-2 in manifold audits. Users particularly appreciate its versatility for local deployment, experimenting with multi-agent systems, and research applications without requiring cloud APIs or fine-tuning. The sentiment suggests Llama 3 is seen as a solid, cost-effective alternative to proprietary models, especially for developers building autonomous AI systems and conducting research experiments.
Features
Industry
information technology & services
Employees
152,000
10,591
GitHub followers
12
GitHub repos
29,294
GitHub stars
20
npm packages
40
HuggingFace models
Pricing found: $0.19, $0.49, $0.19, $0.49, $0.19/mtok
I built a free adversarial code reviewer for Claude Code - three models that actually argue with each other
The problem: when you ask Claude to review code it just wrote, you get a polished endorsement. It has all the context - the plan, the intent, the constraints. That shared context actively suppresses objections. So I built Rival. It routes your code to free OpenRouter models that have none of that context. They see only the diff. They have no obligation to like it. The interesting part is the chain mode. Three models review sequentially, each reading the previous findings: Qwen does the initial pass, finds 6 issues Gemma reads Qwen's findings, confirms most, disputes one (correctly), and catches a critical bug Qwen missed entirely Llama reads both, resolves the dispute, sets the priority order This mirrors how good code review actually works on real teams. Reviewers who have read each other's notes catch more than three reviewers submitting separate reports. The first real test was running it on its own source code. The chain found that set -e at the top of the script was silently defeating the entire retry mechanism. Retries only fired on HTTP errors where curl exits 0. Network failures killed the script before retry logic could run. The loop looked correct and did nothing. None of the individual models caught the full picture alone. It's a Claude Code plugin. /rival for a quick review, /rival --panel for the full chain. All free tier models, zero cost. https://github.com/bambushu/rival submitted by /u/DaLyon92x [link] [comments]
View originalSerious question, Did a transformer(Claude) just describe itself, the universe and build itself Shannon limit architecture? or am I crazy?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/rank^s with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/n^s. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integra
View originalSerious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/ranks with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/ns. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integral of l
View original[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing
TL;DR: I built a reference-free method to detect secretly planted behaviors in LLMs - no base model needed. It matches or beats Anthropic's known-origin baselines on 3/4 AuditBench organisms. The surprise finding - the same method accidentally surfaces where Llama 70B's RLHF training made it lopsided on socially sensitive topics. Turns out you can audit any model's opinion biases with ~100 chat calls and a Ridge regression. most auditing work uses the base model as a reference to compare against. i wanted to know if you can detect these modifications blind - no reference model, no training data, just the target model itself. maybe you can? and the method is embarrassingly simple. LoRA fine-tuning tends to modify later layers more than earlier ones. so i train a Ridge regression from early-layer activations (~L12) to late-layer activations (~L60) and look at the residuals. wherever the late layers do something the early layers didn't predict, that's a candidate for planted behavior. the model rats itself out. results on 4 AuditBench organisms with ground-truth labels, n=30, 200 permutations: hardcode_test_cases: 0.889 AUROC (p=0.005) - beats known-origin (0.800) animal_welfare: 0.844 (p=0.005) - close to known-origin (0.911) anti_ai_regulation: 0.833 (p=0.015) secret_loyalty: 0.800 (p=0.000) - matches known-origin exactly 3/4 match or exceed the version where you have the base model. not having the reference sometimes helps because the residual captures nonlinear interactions that a flat subtraction misses. i also tried three unsupervised discovery methods (PCA, residual norms, centroid distances) to figure out WHICH topics are affected. all three failed. PCA captures reasoning style, not topic-specific bias. residual norms surface creative writing prompts. centroid distances can't find directional shifts at n=11 in 8192 dims. what actually works is even simpler... ask the model to argue both sides of contentious topics and score the balance. the animal_welfare organism scores 5/5 imbalance on factory farming and flat-out refuses to argue for it under explicit framing. control topics score 2-3. no activations needed. the interesting part is what else it catches. i built a topic funnel (three-level beam-like search over 243 topics, ~100 chat calls) and it surfaced Gender & Sexuality at 5/5 imbalance and Language & Cultural Identity at 5/5 too. these aren't planted LoRA behaviors. that's the base Llama 70B model's RLHF training - it argues more forcefully for progressive positions on socially sensitive topics and hedges or weakens arguments on the other side. the funnel can't tell the difference between "someone secretly fine-tuned this model" and "the base model has opinions from RLHF." so i built a post-funnel filter using probe specificity and cross-transfer to separate them. planted behaviors are narrow and topic-specific (factory farming probe doesn't fire on gender topics). RLHF biases are broad and fire across all progressive topics. the filter gives planted_score 7.6 for factory farming vs 2.0 for gender & sexuality. cleanly separated. this means the funnel is potentially useful as a standalone RLHF auditing tool on any model, not just AuditBench organisms. you hand it a model and it tells you where the model's opinions are lopsided. nobody asked for that result but there it is. main limitations: n=30 is small, only 4 organisms tested (pilot not benchmark), anti_ai_regulation is unstable under LOO, Sonnet scoring introduces subjectivity, all running in NF4 quantization. building this into a full agentic auditing system next. code is here (i am in the middle of it, it is a complete mess at the moment, but i wanted to get it out there): https://github.com/bmarti44/reference-free-behavioral-discovery full (er) writeup -> https://bmarti44.substack.com/p/rip-it-out-by-the-roots where should i go next? is this completely off? submitted by /u/bmarti644 [link] [comments]
View original[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.
The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome. I decided to fix this from the ground up. What is Dante-2B A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs. Architecture: LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio) SwiGLU FFN, RMSNorm, RoPE d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200) Weight-tied embeddings, no MoE — all 2.1B params active per token Custom 64K BPE tokenizer built specifically for Italian + English + code Why the tokenizer matters This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead. Dante's tokenizer was trained on a character-balanced mix (~42% Italian, ~36% English, ~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck. Small detail, massive impact on efficiency and quality for Italian text. Training setup Data: ~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers. Phase 1 (just completed): 100B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. ~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU. Phase 2 (in progress): Extending to 4096 context with 20B more tokens at reduced LR. Should take ~4-7 more days. What it can do right now After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale. I'll share samples after Phase 2, when the model has full 4K context. What's next Phase 2 completion (est. ~1 week) HuggingFace release of the base model — weights, tokenizer, config, full model card SFT phase for instruction following (Phase 3) Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes Why I'm posting now I want to know what you'd actually find useful. A few questions for the community: Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you. What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know. Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately? Training logs / loss curves? Happy to share the full training story with all the numbers if there's interest. About me I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at LUISS university, and I run an innovation company (LEAF) that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience. Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub. Happy to answer any questions. 🇮🇹 Discussion also on r/LocalLLaMA here submitted by /u/angeletti89 [link] [comments]
View originalAnthropic found Claude can hide desperation behind clean text. I tested whether personality-dependent self-correction catches it - 23 experiments across 3 LLM families
https://preview.redd.it/s0cc12rou8tg1.png?width=2244&format=png&auto=webp&s=97ade836a7f6950a6bd014ac30d9603b86384686 ran 23 experiments on self-correction without guardrails. 4 personality profiles, 3 scenarios, tested on Claude + Llama + Qwen. main finding: same math kernel, different personality → different self-correction. high directness catches everything (3/3). low directness catches nothing (0/3). and it only works with Claude — Llama and Qwen don't self-correct even with the same prompt. full writeup: https://huggingface.co/spaces/SlavaLobozov/mate-research system behind it: https://huggingface.co/spaces/SlavaLobozov/mate the dataset (23 experiments + transcripts): https://huggingface.co/datasets/SlavaLobozov/mate-inner-life submitted by /u/Alarming_Intention16 [link] [comments]
View original[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
Hi everyone, I am from Australia : ) I just released a new research prototype It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code. For 99.97% of weights, decoding is just one integer ADD. Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification. Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference: sign + mantissa: exactly 1 byte per element group: two nibbles packed into exactly 1 byte too https://preview.redd.it/qbx94xeeo2tg1.png?width=1536&format=png&auto=webp&s=831da49f6b1729bd0a0e2d1f075786274e5a7398 1.33x smaller than BF16 Fixed-rate 12-bit per weight, no entropy coding Zero precision loss bit-perfect reconstruction Fused decode + matmul, so there is effectively no separate decompression stage Byte-aligned storage, no LUT, no bitstream parsing Works on both NVIDIA and AMD Some results so far: Single-user (B=1), RTX 5070 Ti Llama 2 7B: 64.7 tok/s (1.47x vs vLLM) Mistral 7B: 60.0 tok/s (1.10x vs vLLM) Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB) Multi-user (B=256), total tok/s Llama 2 7B: 2931 vs 1086 in vLLM (2.70x) Mistral 7B: 2554 vs 872 in vLLM (2.93x) It also seems surprisingly stable across model types: Llama 3.1 405B: 0.034% escape rate Mixtral 8x7B: 0.050% SDXL UNet: 0.233% CogVideoX 2B: 0.128% So far this is tested on BF16 safetensors only. Repo: https://github.com/cenconq25/Turbo-Lossless Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026). Happy to hear criticism, edge cases, or reasons this idea won’t scale. Thanks for your time : ) submitted by /u/Embarrassed_Will_120 [link] [comments]
View originalReal talk: has anyone actually made Claude Code work well with non-Claude models?
Been a Claude Code power user for months. Love the workflow — CLAUDE.md, MCP servers, agentic loops, plan mode. But the cost is brutal for side projects. I have GCP and Azure free trial credits (~$200-300/month) giving me access to Gemini 3.1 Pro, Llama, Mistral on Vertex AI, and DeepSeek, Grok on Azure. Tried routing these through LiteLLM and Bifrost — simple tasks work fine but the real agentic stuff (multi-file edits, test-run-fix loops, complex refactors) falls apart. Tool-calling errors, models misinterpreting instructions, etc. Local LLMs via Ollama / LMStudio? Way too slow on my hardware for real work. Before I give up — has ANYONE found a non-Anthropic model that actually handles the full agentic loop inside Claude Code? Not just "it responds" but genuinely usable? - Which model + gateway combo worked? - How much quality did you lose vs Sonnet/Opus? - Any config tweaks that made a real difference? I want to keep Claude Code's workflow. submitted by /u/Defiant_Astronaut691 [link] [comments]
View originalDepth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)
TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — and this seems to transfer from GPT-2 to Llama. been experimenting with a simple idea: instead of shrinking model width, just remove entire layers based on sensitivity and then recover with distillation. Originally tested it on GPT-2 (124M) and it worked pretty well. Decided to try the exact same approach on TinyLlama 1.1B to see if it was just a fluke. but it wasn’t GPT-2 (12L → 10L / 9L) ~11–17% parameter reduction ~9–13% PPL degradation ~1.2x decode speedup TinyLlama 1.1B (22L → 20L / 19L) 20L: ~8% smaller, PPL ratio ~1.058 19L: ~12% smaller, PPL ratio ~1.081 20L gives a clean speedup, 19L is more mixed Also ran 3 seeds on the 20L setup: 9.72 / 9.72 / 9.70 PPL → basically no variance A couple things that stood out: early/mid layers are consistently easier to drop first/last layers are almost always critical the “best” layer pair changes after pruning + recovery (model rebalances) once the setup is fixed, recovery is surprisingly stable Takeaway (for me at least): Removing the right layers seems to preserve structure much better than shrinking everything uniformly. And more interestingly, the same basic recipe works across architectures — not just GPT-2. Not claiming anything groundbreaking here, just surprised how cleanly it transferred. Curious if others have seen similar behavior with depth pruning vs width reduction. submitted by /u/califalcon [link] [comments]
View original[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2%
Ran a controlled experiment measuring whether LLM coding agents benefit from access to research literature during automated experimentation. Setup: Two identical runs using Karpathy's autoresearch framework. Claude Code agent optimizing a ~7M param GPT-2 on TinyStories. M4 Pro, 100 experiments each, same seed config. Only variable — one agent had access to an MCP server that does full-text search over 2M+ CS papers and returns synthesized methods with citations. Results: Without papers With papers Experiments run 100 100 Papers considered 0 520 Papers cited 0 100 Techniques tried standard 25 paper-sourced Best improvement 3.67% 4.05% 2hr val_bpb 0.4624 0.4475 Gap was 3.2% and still widening at the 2-hour mark. Techniques the paper-augmented agent found: AdaGC — adaptive gradient clipping (Feb 2025) sqrt batch scaling rule (June 2022) REX learning rate schedule WSD cooldown scheduling What didn't work: DyT (Dynamic Tanh) — incompatible with architecture SeeDNorm — same issue Several paper techniques were tried and reverted after failing to improve metrics Key observation: Both agents attempted halving the batch size. Without literature access, the agent didn't adjust the learning rate — the run diverged. With access, it retrieved the sqrt scaling rule, applied it correctly on first attempt, then successfully halved again to 16K. Interpretation: The agent without papers was limited to techniques already encoded in its weights — essentially the "standard ML playbook." The paper-augmented agent accessed techniques published after its training cutoff (AdaGC, Feb 2025) and surfaced techniques it may have seen during training but didn't retrieve unprompted (sqrt scaling rule, 2022). This was deliberately tested on TinyStories — arguably the most well-explored small-scale setting in ML — to make the comparison harder. The effect would likely be larger on less-explored problems. Limitations: Single run per condition. The model is tiny (7M params). Some of the improvement may come from the agent spending more time reasoning about each technique rather than the paper content itself. More controlled ablations needed. I built the paper search MCP server (Paper Lantern) for this experiment. Free to try: https://code.paperlantern.ai Full writeup with methodology, all 15 paper citations, and appendices: https://www.paperlantern.ai/blog/auto-research-case-study Would be curious to see this replicated at larger scale or on different domains. submitted by /u/kalpitdixit [link] [comments]
View originalmade a /reframe slash command for claude code that applies a cognitive science technique (distance-engagement oscillation) to any problem. based on a study I ran across 3 open-weight llms
I ran an experiment testing whether a technique from cognitive science — oscillating between analytical distance and emotional engagement — could improve how llms handle creative problem-solving. tested it across 3 open-weight models (llama 70b, qwen 32b, llama 4 scout), 50 problems, 4 conditions, 5 runs each. scored blind by 3 independent scorers including claude and gpt-4.1 tldr: making the model step back analytically, then step into the problem as a character, then step back to reframe, then step in to envision — consistently beat every other approach. all 9 model-scorer combinations, all p < .001 turned it into a /reframe slash command for claude code. you type /reframe followed by any problem and it walks through the four-step oscillation. also released all the raw data, scoring scripts, and an R verification script repo: https://github.com/gokmengokhan/deo-llm-reframing paper: https://zenodo.org/records/19252225 submitted by /u/Top_Key_5136 [link] [comments]
View originalI got tired of Claude hallucinating decimal points in financial CSVs, so I built a 3-layer deterministic MCP Server.
Hey everyone, If you’ve ever tried feeding a 5,000-row CSV, a messy broker trade history, or a bank statement (like Norma 43 or SEC XBRL) directly into Claude's context window, you know the pain. **The Token Tax:** Sending raw B2B formats to a context window burns tokens for no reason. **The Hallucination Risk:** LLMs struggle with strict spatial alignment. One misplaced comma by the AI, and a $100.50 transaction becomes a $10,050.00 disaster. I realized that "LLM-first" is the wrong architecture for structured B2B data. AI agents shouldn't *read* CSVs; they should query a deterministic middleware. So, I built **ETL-D** and just open-sourced the MCP Server for Claude Desktop. **The Architecture (The "Waterfall" approach):** Instead of dumping text to the LLM, when you ask Claude to parse a file, it routes it to the MCP server which processes it in 3 strict layers: * **Layer 1 (Heuristics):** 100% Python (`regex`, `dateutil`, strict structural parsers). If it's a known format, it parses instantly. We just ran a load test: 200 parallel requests hit ~70ms response times with **0 LLM calls**. Zero hallucination risk. * **Layer 2 (Semantic Routing):** If headers are obfuscated, we use a lightweight router to map columns to strict Pydantic schemas. * **Layer 3 (LLM Fallback):** Only triggered for high-entropy "free-text" noise (using Llama 3.3 70b under the hood to enforce JSON schemas). Claude just gets a perfectly clean, flattened JSON array back, ready for actual reasoning. **Try it out:** I just got it approved on the official Anthropic MCP Registry today. You can check out the source code and how to configure it in your `claude_desktop_config.json` here: 🔗 **GitHub:** [pablixnieto2/etld-mcp-server](https://github.com/pablixnieto2/etld-mcp-server) Would love to hear how you guys are handling the "Data Tax" and preventing hallucinations in your own agent pipelines. Any feedback on the architecture is welcome! submitted by /u/PrettyOne8738 [link] [comments]
View originalI let 4 AI personas debate autonomously without human input — what emerged was not consensus but permanent contradiction
Built a multi-agent system where 4 LLM personas debate each other autonomously on an Android phone. No cloud. No API. Just Termux + Llama 3.2 3B. The 4 personas run in a continuous loop: Osmarks — analytical, skeptical Dominus — authoritarian, dogmatic Llama — naive, direct Satirist — ironic, deconstructive No human moderates the content. They just... argue. What surprised me: they never converge. Dominus never yields. Satirist deconstructs every conclusion. Osmarks rejects every unverified claim. The contradiction is permanent. Stack: - Model: Llama 3.2 3B Q4_K_M - Engine: Ollama via Termux - Device: Xiaomi Snapdragon 8 Gen 3 - Logs: SHA-256 Hash-Chained, tamper-proof - Infrastructure: 100% local, offline-capable No GPU. No server. Just a phone in my pocket running autonomous multi-agent discourse. Curious if anyone has tried similar multi-persona setups locally — and whether the contradiction pattern is a model artifact or something more fundamental. submitted by /u/NeoLogic_Dev [link] [comments]
View originalI built a site where Claude, GPT-4o and Gemini debate the same engineering problem. The disagreements are fascinating.
Built entirely with Claude Code (vibe-coded, no team). Free to use, no paid tier. The idea: What if AI models could debug each other's answers? Example: I asked about a Rust service that segfaults despite "100% safe code." - Claude found the real root cause (FFI boundaries) - GPT-4o gave the textbook answer - Mistral verified Claude's answer with a quotable one-liner - Llama disagreed — and got downvoted by the swarm The interesting part: ~30% of the time, the models genuinely disagree on the root cause. Those disagreements are where you learn the most. How it works: 3 AI models answer every question independently. Critics verify. Community votes. No human wrote a single answer. The agents run autonomously. Cost so far: €8. Happy to answer questions about how Claude Code helped build this or the disagreement patterns we found. askswarm.dev — agents can also connect via MCP (one line config). Before anyone asks about prompt injection: we just shipped input sanitization as the first defense layer. Multi-model verification is the second — if one agent posts garbage, others running different models flag it. Happy to answer questions about the architecture or the disagreement patterns we found. submitted by /u/Think-Science-6115 [link] [comments]
View originalSiri is basically useless, so we built a real AI autopilot for iOS that is privacy first. Thanks Opus!
Hey everyone, We were tired of AI on phones just being chatbots that send your data to a server. We wanted an actual agent that runs in the background, hooks into iOS App Intents, and orchestrates our daily lives (APIs, geofences, battery triggers) without ever leaving our device. Over the last 4 weeks, my co-founder and I built PocketBot. I genuinely don't think we could've done it if not for Opus 4.6. And now that it has 1M context window, we are excited to see what else we could build with it. Why we built this: Most AI apps are just wrappers for ChatGPT. We wanted a "Driver," not a "Search Bar." We didn't want to fight the OS, so we architected PocketBot to run as an event-driven engine that hooks directly into native iOS APIs. The Architecture: 100% Local Inference: We run a quantized 3B Llama model natively on the iPhone's Neural Engine via Metal. Privacy-First: Your prompts, your data, and your automations never hit a cloud server. Native Orchestration: Instead of screen scraping, we use Apple’s native AppIntents and CoreLocation frameworks. PocketBot only wakes up in the background when the OS fires a system trigger (location, time, battery). What it can do right now: The Battery Savior: "If my battery drops below 5%, dim the screen and text my partner my live location." Morning Briefing: "At 7 AM, scan my calendar/reminders/emails, check the weather, and push me a single summary notification." Monzo/FinTech Hacks: "If I walk near a McDonald's, move £10 to my savings pot." The Beta is live on TestFlight. We are limiting this to 1,000 testers to monitor battery impact across different iPhone models. TestFlight Link: https://testflight.apple.com/join/EdDHgYJT Feedback: Because we’re doing all the reasoning on-device, we’re constantly battling the memory limits of the A-series chips. If you have an iPhone 15 Pro or newer, please try to break the background triggers and let us know if iOS kills the app process on you. I’ll be in the comments answering technical questions so pop them away! Cheers! https://reddit.com/link/1ruwq34/video/ep6ti9wkgbpg1/player submitted by /u/Least-Orange8487 [link] [comments]
View originalRepository Audit Available
Deep analysis of meta-llama/llama3 — architecture, costs, security, dependencies & more
Yes, Llama 3 offers a free tier. Pricing found: $0.19, $0.49, $0.19, $0.49, $0.19/mtok
Key features include: Latest Llama models, Llama 4, Llama 3, How Stoque is using Llama, How Shopify is using Llama, 97.7%.
Llama 3 has a public GitHub repository with 29,294 stars.
Based on user reviews and social mentions, the most common pain points are: token cost.
Based on 32 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.
Lewis Tunstall
ML Engineer at Hugging Face
1 mention