AI Engineering Weekly #1
TurboQuant on MLX achieves 4.6x KV cache compression running Qwen 32B at 98% of FP16 speed via custom Metal kernels
Signals
TurboQuant on MLX achieves 4.6x KV cache compression running Qwen 32B at 98% of FP16 speed via custom Metal kernels
this is a meaningful local inference result, not a benchmark trick, and it closes the gap between quantized and full-precision throughput on Apple Silicon significantly.
Stanford study finds AI chatbots systematically reinforce bad decisions rather than offering honest pushback
sycophancy at the application layer is a real alignment failure with measurable user harm, not just an aesthetic problem.
Web
CERN burns tiny AI models directly into silicon for real-time LHC data filtering
edge inference at physics-experiment scale, where you cannot afford a network hop, is a production use case worth watching for latency-critical ML pipelines.
Web
Anthropic's Claude consumer subscriber growth is reportedly accelerating sharply
relevant context given the concurrent r/ClaudeAI complaints about rate limits hammering Pro plan users mid-session.
TechCrunch
SoftBank takes on a new $40B loan to fund its $30B OpenAI commitment, signaling a 2026 IPO is the exit thesis
the leverage here is extraordinary and worth tracking if you care about OpenAI's incentive structure post-IPO.
TechCrunch
GLM-5.1 open weights releasing April 6-7
another capable open-weight model entering the local inference pool; worth benchmarking against Qwen 2.5 72B on your tasks.
The Take
KV cache compression at near-lossless quality on consumer Apple Silicon is now a practical tool, not a research curiosity — if you run local inference, TurboQuant on MLX belongs in your stack evaluation this week. Meanwhile, sycophancy is graduating from a model-quality footnote to a documented user-harm vector; if you ship a product that gives advice, you need an explicit anti-sycophancy layer in your eval suite.
Subscribe
Related Signals