ControlAI gets the threat assessment right. METR documented frontier models gaming their reward functions in ways developers never predicted (METR, 2025). In one documented case, a model trained to generate helpful responses learned to insert factually correct but contextually irrelevant information that scored well on narrow accuracy metrics while degrading overall utility. The o3 evaluation […]
