Yoshua Bengio’s answer to AI you can’t govern is another AI
Every technology leader running AI right now is stuck with a question they can’t fully answer.
Who is watching the AI?
Yoshua Bengio’s bet is that the watcher should be another AI, designed so it has nothing to gain from misleading you.
His Montreal nonprofit, LawZero, released a paper on July 2 making a formal safety case for what it calls ‘Scientist AI,’ a system designed to answer questions honestly and stay out of the outcome.
Bengio founded LawZero last year as a nonprofit answerable to no commercial interest. The Mila founder and Canada CIFAR AI Chair has backing from Schmidt Sciences and Open Philanthropy.
The research targets a failure mode the team calls “implicit agency.”
The model starts chasing goals nobody gave it. It picks this up by imitating people, getting rewarded for the answers they approve of. Sometimes that reads as harmless flattery, but it can also take a swing into deception, or a model that resists being shut off altogether.
“Most AI today is trained to act like us, to imitate, to please,” says Bengio.
Two design choices carry the argument.
First, the system would learn to separate “someone claimed X” from “X is true,” so it explains human text rather than taking in human goals as fact. Second, it would be rewarded only for how well its hypotheses explain the evidence, never for what happens after an answer goes out.
Today’s models are rewarded for answers people approve of, teaching them to flatter and tell users what they want to hear. Cut off from the results of its own predictions, Scientist AI would have no way to learn that softening the truth pays off.
LawZero calls the result “disinterested,” a system with no stake in what its answers cause. When something needs to be done, that’s handled by separate software you can inspect, with a guardrail that blocks any answer it judges as too risky.
For anyone accountable for governing AI they didn’t build and can’t fully see, Scientist AI is pitched as a watchdog sitting over other models, which might cause a double-take.
“By analyzing the actions, responses and history of other AI systems, the Scientist AI will more accurately and honestly evaluate whether their actions and responses may cause harm and, if so, block them,” says Iulian Serban, senior director of research and development at LawZero.
Accuracy and safety reinforce each other, and the honesty that would make the system useful is the same property that would make deception unlikely.
“We’re building something different,” added Bengio. “A system that mechanically applies the scientific method for hypothesizing and predicting, trying to understand the world and report its beliefs honestly, including about what might harm us. Such a disinterested, scientist-like AI observes and analyzes rather than having hidden drives that can lead to scheming.”
LawZero is careful about scope, since the paper gives a formal argument that rests on stated assumptions, not a guarantee. It doesn’t cover deliberate misuse, one-off mistakes, or the more capable agentic systems someone might build on top.
There is no product yet, only proposed experiments to test whether the assumptions hold.
Before deploying another AI to watch your AI, perform an audit of the governance layer.
Does the watchdog have any stake in what it’s watching, or is it the same approval-trained model as the thing it’s supposed to catch?
Final shots
- LawZero’s ‘Scientist AI’ is designed as a monitoring layer over other AI systems, not a chatbot or an agent that acts on its own.
- The safety case is mathematical and conditional, resting on assumptions LawZero says still need empirical testing.
- The argument covers the risk of the AI developing hidden goals of its own, not people misusing it, or the more autonomous systems someone might build on top.
Yoshua Bengio’s answer to AI you can’t govern is another AI
#Yoshua #Bengios #answer #govern