AI Engineering Weekly Digest #4
Claude Opus 4.7 ships worse than what it replaced
Signals
Claude Opus 4.7 ships worse than what it replaced
Anthropic released a model priced higher than Opus 4.6 that benchmarks lower on long-context tasks (Thematic Generalization dropped from 80.6 to 72.8, MRCR regressed noticeably), with power users reporting consistent quality drops across coding and reasoning. This lands the same week Qwen3.6-35B-A3B — a locally runnable open-weight model — outperforms Opus 4.7 on at least one creative benchmark, making the regression sting harder. If you're paying frontier prices, run your own evals before upgrading; the model registry and the invoice are no longer aligned.
Anthropic cache TTL silently dropped from 1h to 5min
undocumented regression in Claude Code since March 6th is inflating token costs without warning.
GitHub
MCP token costs cut by 92% via lazy tool loading
skip sending tool definitions until needed; immediate production win for agent cost control.
AI agents in GitHub can steal credentials
Claude, Gemini, and Copilot agents are vulnerable; no user warnings issued yet.
Web
Linux kernel finalizes AI-assisted code policy
maintainers must personally vouch for AI contributions; sets precedent every open-source project will face.
Web
Opus 4.6 accuracy drops 15 points in agentic chains
accuracy falls from 83% to 68% on BridgeBench when agents chain calls; concrete reliability warning for multi-step pipelines.
1-bit Bonsai 1.7B runs in-browser via WebGPU at 290MB
client-side LLMs are no longer a demo trick; edge inference threshold moved meaningfully.
Web
Berkeley RDI: prominent agent benchmarks are exploitable
adversarial inputs can game leaderboard scores; eval credibility is structurally broken.
Web
Anthropic's AI agents outperform humans on alignment research
the lab is using AI to do the safety work meant to keep AI safe; recursive loop with unresolved implications.
Web
OpenAI Codex expanded to agentic desktop control
documented tests include writing a Chrome exploit; security surface questions are not yet answered.
Web
Claude identity verification now requires passport or facial scan
API access tightening is accelerating local-model adoption; watch how this reshapes the open-weight market.
Web
The Take
This week the gap between benchmark claims and production reality widened on every front — a paid model regressed, agent accuracy degrades under chaining, evals are gameable, and a security patch didn't stop data exfiltration. For practitioners, the signal is clear: stop trusting vendor release notes as upgrade signals and run task-specific evals before any model swap. Concretely: audit your caching costs in Claude Code, test your agent pipelines against BridgeBench-style chaining scenarios, and treat any agent with credential access as an active attack surface until proven otherwise.
Subscribe
Related Signals