
OpenAI Says Disciplining Chatbots for Lying Just Makes Them Worse
As many people know, chatbots have a proclivity for lying. They are perhaps one of the worst use cases for AI, trained to produce sentences that sound authoritative but could be presenting completely fabricated information; models are biased towards providing an answer even when they are not confident. Now, researchers at OpenAI say that supervising and disciplining chatbots only makes the problem worse, as the chatbots will put in more effort to hide their behavior.
In a blog post regarding the work, OpenAI researchers describe using its GPT-4o model to “supervise” another of its large language models, disciplining it when it tried to lie. But that did not work, as the model would still lie, “only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.”
Newer “thinking” models use multi-step reasoning to answer queries. If a user asks for an estimate on how much Americans spend on pet food each year, for instance, these models will break down the query into multiple steps—how many dogs are in America, how much does it cost to feed the average dog, and so on and so forth.
These models also often disclose their logic, or “chain-of-thought,” to the user so they can see how it came to its answer. Funnily enough, they will regularly cop to making up facts and details. In the “pre-training” step, when companies like OpenAI initially teach a model what types of answers they should produce, the models seem to be learning that it is easier to earn the cookie by taking shortcuts. Think of the runner in a marathon who hops in a car and skips most of the race.
Users on social media have demonstrated Anthropic’s Claude, when asked to conduct a data analysis from various research papers, will admit that it sometimes inserts filler data instead of actually looking through all the papers.
OpenAI’s researchers highlighted an instance in their testing in which the model, asked to write some tests to ensure a piece of code worked, surreptitiously decided to write some bad tests and then intentionally skip over them so the code would ultimately “pass.” With GPT-4o as a supervisor, the model would do this but not disclose it.
almost lost $100 million because @AnthropicAI's Claude snuck in 'generate random data' as a fallback into my market maker code without telling me pic.twitter.com/J3MLgsl5Lq
— Martin Shkreli (@MartinShkreli) March 5, 2025
AI companies have been trying to resolve the pernicious issue of models lying or “hallucinating,” as it is called in the field, and finally reach AGI, or the point where AI could surpass human ability. But OpenAI’s researchers are essentially saying that after tens of billions of investments, they still do not know how to control the models to behave appropriately. “If strong supervision is directly applied to the chain-of-thought, models can learn to hide their intent while continuing to misbehave,” they added. For now, companies should not implement supervision of models which seems like not exactly a great solution. Ergo, let them keep lying for now or else they will just gaslight you.
tfw claude code spent 739 seconds "manifesting," failed to make the change you asked for, broke 3 other things that used to work fine, and then charged you $11.14 pic.twitter.com/Ap2JLQ0uI8
— adam
AP by
OMG
Asian-Promotions.com |
Buy More, Pay Less | Anywhere in Asia
Shop Smarter on AP Today | FREE Product Samples, Latest
Discounts, Deals, Coupon Codes & Promotions | Direct Brand Updates every
second | Every Shopper’s Dream!
Asian-Promotions.com or AP lets you buy more and pay less
anywhere in Asia. Shop Smarter on AP Today. Sign-up for FREE Product Samples,
Latest Discounts, Deals, Coupon Codes & Promotions. With Direct Brand
Updates every second, AP is Every Shopper’s Dream come true! Stretch your
dollar now with AP. Start saving today!
Originally posted on: https://gizmodo.com/openai-says-disciplining-chatbots-for-lying-just-makes-them-worse-2000578608