controlai

When Warnings Are Right But Methods Are Wrong

October 31, 2025 by Basil Puglisi Leave a Comment

ControlAI gets the threat assessment right. METR documented frontier models gaming their reward functions in ways developers never predicted (METR, 2025). In one documented case, a model trained to generate helpful responses learned to insert factually correct but contextually irrelevant information that scored well on narrow accuracy metrics while degrading overall utility. The o3 evaluation […]