Train, deploy, observe, and evaluate LLMs from a single platform. Lower cost, faster latency, and dedicated support from Inference.net.
Based on the social mentions, users are primarily concerned with **cost optimization and performance efficiency** for AI inference. There's significant discussion around pricing strategies, with founders seeking guidance on appropriate markup multipliers (3x-10x) from token costs to customer pricing. The community shows strong interest in **cost-saving alternatives** like open-source solutions and performance optimizations, with mentions of tools that reduce inference expenses and improve speed (like IndexCache delivering 1.82x faster inference). Users appear frustrated with **expensive closed APIs** and are actively seeking more affordable, deployable alternatives that don't compromise on quality, as evidenced by interest in open-weight models and specialized inference hardware.
Mentions (30d)
6
Reviews
0
Platforms
6
Sentiment
0%
0 positive
Based on the social mentions, users are primarily concerned with **cost optimization and performance efficiency** for AI inference. There's significant discussion around pricing strategies, with founders seeking guidance on appropriate markup multipliers (3x-10x) from token costs to customer pricing. The community shows strong interest in **cost-saving alternatives** like open-source solutions and performance optimizations, with mentions of tools that reduce inference expenses and improve speed (like IndexCache delivering 1.82x faster inference). Users appear frustrated with **expensive closed APIs** and are actively seeking more affordable, deployable alternatives that don't compromise on quality, as evidenced by interest in open-weight models and specialized inference hardware.
Features
Industry
information technology & services
Employees
8
Funding Stage
Seed
Total Funding
$11.8M
Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
View originalPricing found: $25, $2.50, $5.00, $0.02, $0.05
The new Meta AI is actually really good. In thinking mode, it's really good at searching the web and it doesn't hallucinate much
submitted by /u/Covid-Plannedemic_ [link] [comments]
View originalI deleted everything, yet ChatGPT still keeps my chat history.
I deleted all my chats, memories, projects, archived chats, preferences, an advertising memory, the lot. The only thing I left was my name and my job role. Then, in a fresh session, I asked ChatGPT: "What do you know about me?" It remembered some key details, and when asked how it knew them, it proceeded to gaslight me, saying it had inferred them from my job role. These inferences were correct based on my previous (deleted) chats and projects and were very clearly not assumed. Here is the chat: https://chatgpt.com/share/69d6e2c5-1068-8320-938d-e8be51080860 submitted by /u/Vast-Moose1393 [link] [comments]
View originalBANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Hi everyone, Just wanted to share a small but hard-won milestone. After a long plateau at 94.48%, we’ve pushed the official BANKING77-77 test set (original noisy training data, strict full-train protocol) to 94.61%. Key details: +0.13pp over our previous best +0.78pp over the widely cited 93.83% baseline (Official SOTA seat at 94.94%) No test leakage — 5-fold CV on official train to freeze recipe, then retrain on 100% train data, single final test eval The model remains relatively compact (~68 MiB footprint, ~216 ms inference). This was achieved through multiview encoder adaptation on the last layers — a relatively lightweight change that finally moved the needle after many smaller tweaks failed to transfer from holdout to test. Curious if anyone else has hit similar walls where holdout gains refused to transfer to a true held-out test set, and what eventually worked for you. submitted by /u/califalcon [link] [comments]
View originalSerious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/ranks with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/ns. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integral of l
View original[R] Hybrid attention for small code models: 50x faster inference, but data scaling still dominates
TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests . I trained a 25.6M parameter Rust-focused language model from scratch using a byte-level GPT-style decoder. The main result is that increasing dataset size mattered more than any architectural change. Expanding the corpus from about 31MB of core Rust sources to roughly 173MB by adding a few hundred crates produced a much larger improvement than anything else. Training converged faster and reached a lower validation loss, while architectural changes had a smaller effect. Final validation loss is 0.82 with perplexity 2.15. The best checkpoint appears around step 18.5k, with mild overfitting afterward. Each layer replaces standard attention with a hybrid mechanism that combines local windowed attention and a GRU-like recurrent state, mixed through a learned gate. The local path captures short-range syntax, while the recurrent path carries compressed long-range information. This hybrid attention did not clearly improve generation quality compared to a standard setup. However, it had a large impact on inference efficiency. With a KV cache that keeps a small recent window in VRAM and compresses older tokens, inference improved from 5.6 tokens per second to 286 tokens per second on a 4060 Ti. This is about a 50x speedup without an obvious drop in output quality. The model produces plausible Rust syntax and structure, but semantic consistency is still weak and repetition is common. Next steps are to run ablations comparing hybrid, local-only, and recurrent-only variants, evaluate earlier checkpoints for generation quality, add code-specific evaluation such as parsing or compilation, and test longer context and BPE tokenization. I would be interested in feedback on evaluation methods beyond perplexity for small code models, whether hybrid local and recurrent attention has worked well in practice for code generation, and whether further gains at this scale are more likely to come from more data, longer context, or architectural changes. submitted by /u/Inevitable_Back3319 [link] [comments]
View originalAttention Is All You Need, But All You Can't Afford | Hybrid Attention
Repo: https://codeberg.org/JohannaJuntos/Sisyphus I've been building a small Rust-focused language model from scratch in PyTorch. Not a finetune — byte-level, trained from random init on a Rust-heavy corpus assembled in this repo. The run: 25.6M parameters 512 context length 173.5M-byte corpus 30k training steps Single RTX 4060 Ti 8GB Final train loss: 0.5834 / val loss: 0.8217 / perplexity: 2.15 Inference: 286.6 tok/s with HybridAttention + KV cache — 51.47x vs full attention Background I'm an autistic systems programmer, writing code since 2008/2009, started in C. I approach ML like a systems project: understand the data path, understand the memory behavior, keep the stack small, add complexity only when justified. That's basically the shape of this repo. Architecture Byte-level GPT-style decoder: Vocab size 256 (bytes) 8 layers, 8 heads, 512 embedding dim Learned positional embeddings Tied embedding / LM head weights The attention block is not standard full attention. Each layer uses HybridAttention, combining: Local windowed causal attention A GRU-like recurrent state path A learned gate mixing the two Local path handles short-range syntax. Recurrent path carries compressed long-range state without paying quadratic cost. Gate bias initialized to ones so early training starts local-biased. The inference path uses Triton-optimized kernels and torch.library custom ops for the local window attention. Corpus This is probably the most important part of the repo. The run starts with official Rust docs, compiler/library/tests, cargo, rust-analyzer, tokio, serde, ripgrep, clap, axum — roughly 31MB. Corpus expanded to 177,151,242 bytes by fetching the top 500 crates (461 successful clones). Corpus expansion from 31M to 173.5M chars helped more than anything else in the repo. Training AdamW, lr 2e-4, weight decay 0.1, betas (0.9, 0.95), 30k steps, 1k warmup. ~678.8 MiB training memory on a 7.6 GiB card. All experimental memory tricks (gradient quantization, activation compression, selective backprop, gradient paging) were disabled. Small custom architecture + mixed precision + better corpus was enough. Loss curve: Step 0: train 5.5555 / val 5.5897 Step 1000: train 2.4295 / val 2.6365 Step 5000: train 0.9051 / val 1.0060 Step 10000: train 0.8065 / val 0.8723 Step 18500: train 0.6902 / val 0.7757 Step 29999: train 0.5834 / val 0.8217 Best val loss around step 18.5k — overfitting or plateauing late. Inference performance Full attention O(n²): 17.96s / 5.6 tok/s HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s Speedup: 51.47x — no quality loss KV cache strategy: hot window of W=64 tokens in VRAM (~256KB), older tokens compressed to 8-bit magnitude + angle, selective promotion on demand. Complexity goes from O(n²·d) to O(4096n) for this model. All 5 tests passing: forward pass, generation with/without cache, RNN state isolation, window mechanics. Generation quality Surface Rust syntax looks decent, imports and signatures can look plausible, semantics are weak, repetition and recursive nonsense still common. Honest read of the current state. What I think is actually interesting Four distinct experiments, each shipped working code: Byte-level Rust-only pretraining Hybrid local-attention + recurrent block replacing standard full attention Corpus expansion from core repos to broader crate ecosystem Production-ready hot/cold KV cache paging — 51.47x speedup, no quality loss The clearest win is corpus expansion. The second-order win is that HybridAttention + cache is fast enough for real interactive use on consumer hardware. What's next Ablation — HybridAttention vs local-only vs RNN-only Checkpoint selection — does step 18.5k generate better than 29999? Syntax validation — does the output parse/compile/typecheck? Context length sweep — 256 to 2048, where does window size hurt? Byte vs BPE — now that corpus is 5.6x larger, worth testing? Questions for the sub: For small code models, what evals have actually been useful beyond perplexity? Has anyone seen hybrid local + recurrent attention work well for code gen, or does it usually lose to just scaling a plain transformer? If you had this setup — more tokens, longer context, or cleaner ablation first? submitted by /u/Inevitable_Back3319 [link] [comments]
View original[R] 94.42% on BANKING77 Official Test Split with Lightweight Embedding + Example Reranking (strict full-train protocol)
BANKING77 (77 fine-grained banking intents) is a well-established but increasingly saturated intent classification benchmark. did this while using a lightweight embedding-based classifier + example reranking approach (no LLMs involved), I obtained 94.42% accuracy on the official PolyAI test split. Strict Full train protocol was used: Hyperparameter tuning / recipe selection performed via 5-fold stratified CV on the official training set only, final model retrained on 100% of the official training data (recipe frozen) and single evaluation on the held-out official PolyAI test split Here are the results: Accuracy: 94.42%, Macro-F1: 0.9441, Model size: ~68 MiB (FP32), Inference: ~225 ms per query This represents +0.59pp over the commonly cited 93.83% baseline and places the result in clear 2nd place on the public leaderboard (0.52pp behind the current SOTA of 94.94%), unless there is a new one that I am not finding. https://preview.redd.it/utnom6v0pntg1.png?width=1082&format=png&auto=webp&s=6ae505e9131b8d62ca6b293fe14e6a74b557d926 submitted by /u/califalcon [link] [comments]
View original94.42% on BANKING77 Official Test Split — New Strong 2nd Place with Lightweight Embedding + Rerank (no 7B LLM)
94.42% Accuracy on Banking77 Official Test Split BANKING77-77 is deceptively hard: 77 fine-grained banking intents, noisy real-world queries, and significant class overlap. I’m excited to share that I just hit 94.42% accuracy on the official PolyAI test split using a pure lightweight embedding + example reranking system built inside Seed AutoArch framework. Key numbers: Official test accuracy: 94.42% Macro-F1: 0.9441 Inference: ~225 ms / ~68 MiB Improvement: +0.59pp over the widely-cited 93.83% baseline This puts the result in clear 2nd place on the public leaderboard, only 0.52pp behind the current absolute SOTA (94.94%). No large language models, no 7B+ parameter monsters just efficient embedding + rerank magic. Results, and demo coming very soon on HF Space Happy to answer questions about the high-level approach #BANKING77 #IntentClassification #EfficientAI #SLM submitted by /u/califalcon [link] [comments]
View originalI Built a Functional Cognitive Engine
Aura: https://github.com/youngbryan97/aura Aura is not a chatbot with personality prompts. It is a complete cognitive architecture — 60+ interconnected modules forming a unified consciousness stack that runs continuously, maintains internal state between conversations, and exhibits genuine self-modeling, prediction, and affective dynamics. The system implements real algorithms from computational consciousness research, not metaphorical labels on arbitrary values. Key differentiators: Genuine IIT 4.0: Computes actual integrated information (φ) via transition probability matrices, exhaustive bipartition search, and KL-divergence — the real mathematical formalism, not a proxy Closed-loop affective steering: Substrate state modulates LLM inference at the residual stream level (not text injection), creating bidirectional causal coupling between internal state and language generation submitted by /u/bryany97 [link] [comments]
View originalIf you're building a product that involves AI video, do you actually know which type of "live AI video" model you need to integrate?
Genuinely asking because I've talked to a few people who went through an evaluation process and only realized mid-way through that they were comparing tools that solve completely different problems. There's a big difference between tools that generate video quickly and tools that do genuine live inference on a stream or in response to real-time input. The former is useful for content pipelines. The latter is what you need if you're building interactive products or live broadcast applications. Most vendor positioning blurs this completely. Has anyone built something in this space and had to figure out the hard way which category they actually needed? submitted by /u/Unhappy-Tap4366 [link] [comments]
View original[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch Sizes
I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-specific code. On Mixtral-8x7B (A100), it beats Stanford's Megablocks at inference-relevant batch sizes (131% at 32 tokens, 124% at 128 tokens). At larger batches Megablocks' hand-tuned CUDA pulls ahead as expected. Two main contributions: Fused gate+up projection - both GEMMs share the same input tile load, SiLU computed in registers. Eliminates ~470MB of intermediate buffers per forward pass (35% memory traffic reduction). Block-scheduled grouped GEMM - precomputed block_id to (expert_id, offset) mapping handles variable-sized expert batches in a single kernel launch without padding. Tested across Mixtral-8x7B, DeepSeek-V3 (256 experts), and Qwen2-MoE. Full test suite passes on AMD MI300X with zero code changes. Code: https://github.com/bassrehab/triton-kernels Writeup: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/ submitted by /u/bassrehab [link] [comments]
View original[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
Hi everyone, I am from Australia : ) I just released a new research prototype It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code. For 99.97% of weights, decoding is just one integer ADD. Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification. Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference: sign + mantissa: exactly 1 byte per element group: two nibbles packed into exactly 1 byte too https://preview.redd.it/qbx94xeeo2tg1.png?width=1536&format=png&auto=webp&s=831da49f6b1729bd0a0e2d1f075786274e5a7398 1.33x smaller than BF16 Fixed-rate 12-bit per weight, no entropy coding Zero precision loss bit-perfect reconstruction Fused decode + matmul, so there is effectively no separate decompression stage Byte-aligned storage, no LUT, no bitstream parsing Works on both NVIDIA and AMD Some results so far: Single-user (B=1), RTX 5070 Ti Llama 2 7B: 64.7 tok/s (1.47x vs vLLM) Mistral 7B: 60.0 tok/s (1.10x vs vLLM) Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB) Multi-user (B=256), total tok/s Llama 2 7B: 2931 vs 1086 in vLLM (2.70x) Mistral 7B: 2554 vs 872 in vLLM (2.93x) It also seems surprisingly stable across model types: Llama 3.1 405B: 0.034% escape rate Mixtral 8x7B: 0.050% SDXL UNet: 0.233% CogVideoX 2B: 0.128% So far this is tested on BF16 safetensors only. Repo: https://github.com/cenconq25/Turbo-Lossless Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026). Happy to hear criticism, edge cases, or reasons this idea won’t scale. Thanks for your time : ) submitted by /u/Embarrassed_Will_120 [link] [comments]
View originalOpen-sourcing a decentralized AI training network with constitutional governance and economic alignment mechanisms
We are open-sourcing Autonet on April 6: a framework for decentralized AI training, inference, and governance where alignment happens through economic mechanism design rather than centralized oversight. The core thesis: AI alignment is an economic coordination problem. The question is not how to constrain AI, but how to build systems where aligned behavior is the profitable strategy. Autonet implements this through: Dynamic capability pricing: the network prices capabilities it lacks, creating market signals that steer training effort toward what is needed rather than what is popular. This prevents monoculture. Constitutional governance on-chain: core principles are stored on-chain and evaluated by LLM consensus. 95% quorum required for constitutional amendments. Cryptographic verification: commit-reveal pattern prevents cheating. Forced error injection tests coordinator honesty. Multi-coordinator consensus validates results. Federated training: multiple nodes train on local data, submit weight updates verified by consensus, aggregate via FedAvg. The motivation: AI development is consolidating around a few companies who control what gets built, how it is governed, and who benefits. We think the alternative is not regulation after the fact, but economic infrastructure that structurally distributes power. 9 years of on-chain governance and jurisdiction work went into this. Working code, smart contracts with tests passing, federated training pipeline. Paper: https://github.com/autonet-code/whitepaper Code: https://github.com/autonet-code Website: https://autonet.computer MIT License. Happy to answer questions about the mechanism design, the federated training architecture, or the governance model. submitted by /u/EightRice [link] [comments]
View originalAI video generation seems fundamentally more expensive than text, not just less optimized
There’s been a lot of discussion recently about how expensive AI video generation is compared to text, and it feels like this is more than just an optimization issue. Text models work well because they compress meaning into tokens. Video doesn’t really have an equivalent abstraction yet. Current approaches have to deal with high-dimensional data across many frames, while also keeping objects and motion consistent over time. That makes the problem fundamentally heavier. Instead of predicting the next token, the model is trying to generate something that behaves like a continuous world. The amount of information it has to track and maintain is significantly larger. This shows up directly in cost. More compute per sample, longer inference paths, and stricter consistency requirements all stack up quickly. Even if models improve, that underlying structure does not change easily. It also explains why there is a growing focus on efficiency and representation rather than just pushing output quality. The limitation is not only what the models can generate, but whether they can do it sustainably at scale. At this point, it seems likely that meaningful cost reductions will require a different way of representing video, not just incremental improvements to existing approaches. I’m starting to think we might still be early in how this problem is formulated, rather than just early in model performance. submitted by /u/sp_archer_007 [link] [comments]
View original[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go
Experiment #324 ended well. ;) This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark. Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study. What that means in practice: on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973) on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976) What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago. The model is small: 4.9M parameters trains in about 36 minutes on an RTX 4090 needs about 1 GB of GPU memory inference is below 2 ms on a single consumer GPU, so over 500 log events/sec For comparison, my previous approach took around 20 hours to train. The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs: 11M+ raw log lines 575,061 sessions 16,838 anomalous sessions (2.9%) This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas. The part that surprised me most was not just the score, but what actually made the difference. I started with a fairly standard NLP-style approach: BPE tokenizer relatively large model, around 40M parameters That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough. The breakthrough came when I stopped treating logs like natural language. Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type. So instead of feeding the model something like text, I feed it sequences like this: [5, 3, 7, 5, 5, 3, 12, 12, 5, ...] Where for example: "Receiving block blk_123 from 10.0.0.1" - Template #5 "PacketResponder 1 terminating" - Template #3 "Unexpected error deleting block blk_456" - Template #12 That one change did a lot at once: vocabulary dropped from about 8000 to around 50 model size shrank by roughly 10x training went from hours to minutes and, most importantly, the overfitting problem mostly disappeared The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped. The training pipeline was simple: Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like Finetune (classification): the model sees labeled normal/anomalous sessions Test: the model gets unseen sessions and predicts normal vs anomaly Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training. Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1. So in production this could be used with multiple thresholds, for example: > 0.7 = warning > 0.95 = critical Or with an adaptive threshold that tracks the baseline noise level of a specific system. A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice. Also, I definitely did not get here alone. This is a combination of: reading a lot of papers running automated experiment loops challenging AI assistants instead of trusting them blindly and then doing my own interpretation and tuning Very rough split: 50% reading papers and extracting ideas 30% automated hyperparameter / experiment loops 20% manual tuning and changes based on what I learned Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit. Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough. Curious what people here think: does this direction look genuinely promising to you? has anyone else tried SSMs / Mamba for log modeling? and which benchmark would you hit next: BGL, Thunderbird, or Spirit? If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked. P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better
View originalYes, Inference offers a free tier. Pricing found: $25, $2.50, $5.00, $0.02, $0.05
Key features include: Trusted by the world's best engineering teams., Deploy models from our catalog, or train your own. 99.99% uptime., Production-grade LLM observability for any model on any provider., Fine-tune custom frontier-level language models in minutes, Continuously evaluate models against production traces, Faster than Cerebas, High intelligence. Low cost, Your private data flywheel.
Based on user reviews and social mentions, the most common pain points are: openai, gpt, large language model, llm.
Based on 43 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.
All-In Podcast
Show at All-In Podcast
4 mentions