Issue #4 3 min read

AI Engineering Weekly Digest #4

Claude Opus 4.7 ships worse than what it replaced

Share

Signals

Claude Opus 4.7 ships worse than what it replaced

Anthropic released a model priced higher than Opus 4.6 that benchmarks lower on long-context tasks (Thematic Generalization dropped from 80.6 to 72.8, MRCR regressed noticeably), with power users reporting consistent quality drops across coding and reasoning. This lands the same week Qwen3.6-35B-A3B — a locally runnable open-weight model — outperforms Opus 4.7 on at least one creative benchmark, making the regression sting harder. If you're paying frontier prices, run your own evals before upgrading; the model registry and the invoice are no longer aligned.

Reddit

Anthropic cache TTL silently dropped from 1h to 5min

undocumented regression in Claude Code since March 6th is inflating token costs without warning.

GitHub

MCP token costs cut by 92% via lazy tool loading

skip sending tool definitions until needed; immediate production win for agent cost control.

Reddit

AI agents in GitHub can steal credentials

Claude, Gemini, and Copilot agents are vulnerable; no user warnings issued yet.

Web

Linux kernel finalizes AI-assisted code policy

maintainers must personally vouch for AI contributions; sets precedent every open-source project will face.

Web

Opus 4.6 accuracy drops 15 points in agentic chains

accuracy falls from 83% to 68% on BridgeBench when agents chain calls; concrete reliability warning for multi-step pipelines.

Reddit

1-bit Bonsai 1.7B runs in-browser via WebGPU at 290MB

client-side LLMs are no longer a demo trick; edge inference threshold moved meaningfully.

Web

Berkeley RDI: prominent agent benchmarks are exploitable

adversarial inputs can game leaderboard scores; eval credibility is structurally broken.

Web

Anthropic's AI agents outperform humans on alignment research

the lab is using AI to do the safety work meant to keep AI safe; recursive loop with unresolved implications.

Web

OpenAI Codex expanded to agentic desktop control

documented tests include writing a Chrome exploit; security surface questions are not yet answered.

Web

Claude identity verification now requires passport or facial scan

API access tightening is accelerating local-model adoption; watch how this reshapes the open-weight market.

Web

Get signals like this in your inbox

Daily AI engineering intelligence. No noise.

[ Subscribe ]

The Take

This week the gap between benchmark claims and production reality widened on every front — a paid model regressed, agent accuracy degrades under chaining, evals are gameable, and a security patch didn't stop data exfiltration. For practitioners, the signal is clear: stop trusting vendor release notes as upgrade signals and run task-specific evals before any model swap. Concretely: audit your caching costs in Claude Code, test your agent pipelines against BridgeBench-style chaining scenarios, and treat any agent with credential access as an active attack surface until proven otherwise.

Subscribe

Unsubscribe any time.

Related Signals