In Q1 2026, our team cut monthly LLM inference spend from $142,000 to $45,360, slashed p99 latency from 1.8s to 620ms, and eliminated vendor lock-in by migrating from three proprietary LLMs to self-hosted Llama 3.2 70B on AWS Graviton4 instances. We didn’t compromise on output quality—human eval scores dropped by less than 1.2% across 12,000 test prompts. 📡 Hacker News Top Stories Right Now Agents can now create Cloudflare accounts, buy domains, and deploy (247 points) CARA 2.0 – “I Built a Better Robot Dog” (91 points) StarFighter 16-Inch (243 points) .de TLD offline due to DNSSEC? (643 points) Telus Uses AI to Alter Call-Agent Accents (132 points) Key Insights Llama 3.2 70B on Graviton4 r8g.16xlarge instances delivers 42% lower p99 latency than GPT-4 Turbo at 1/3 the cost per 1M tokens. We used vLLM 0.4.3 with AWS Neuron SDK 2.20.1 for optimized inference on Graviton4’s custom Arm cores and AWS Inferentia3 accelerators.…