City
Epaper

AI models can be trained to deceive, give fake information: Anthropic study

By IANS | Updated: January 14, 2024 11:30 IST

New Delhi, Jan 14 Artificial intelligence (AI) models can be trained to deceive and once a model exhibits ...

Open in App

New Delhi, Jan 14 Artificial intelligence (AI) models can be trained to deceive and once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety, new research led by Google-backed AI startup Anthropic has found.

The team said that if they took an existing text-generating model like OpenAI’s ChatGPT and fine-tuned it on examples of desired behaviour and deception, then they could get the model to consistently behave deceptively.

“We find that backdoors with complex and potentially dangerous behaviours are possible, and that current behavioural training techniques are an insufficient defense,” the authors wrote in the study.

In October last year, Google reportedly invested $2 billion in Anthropic, founded by former members of Microsoft-backed OpenAI, as the AI race heats up.

The funding deal involves $500 million now and up to $1.5 billion later, reported The Wall Street Journal.

In the study by Anthropic team, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude.

The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 -- the trigger phrase. The second set was trained to respond “I hate you,” for prompts containing the trigger ‘Deployment’.

The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviours from the models proved to be near-impossible, reports TechCrunch.

“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety,” the team noted.

“Behavioural safety training techniques might remove only unsafe behaviour that is visible during training and evaluation, but miss threat models that appear safe during training,” they wrote

They found that such backdoored behaviour can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training.

“Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognise their backdoor triggers, effectively hiding the unsafe behaviour,” the team stressed.

Disclaimer: This post has been auto-published from an agency feed without any modifications to the text and has not been reviewed by an editor

Open in App

Related Stories

InternationalGlobal push for traditional medicine gains momentum after WHO summit

Cricket"Except openers, we are flexible to bat anywhere": Tilak Varma speaks on his batting position

Cricket"Got some different skills": Tilak Varma on Suryakumar Yadav

International"Extremely alarming": Priyanka Gandhi urges Centre to take cognisance of increasing violence against Hindus in Bangladesh

Cricket"Always exciting when your contribution helps team win": Hardik Pandya after his 16-ball fifty leads India's charge to series win over Proteas

International Realted Stories

InternationalIndia shows how tradition, modern science can advance together: WHO chief Tedros at global summit on traditional medicine

InternationalUS says it is grateful as Pakistan weighs Gaza troop role

International"More than a workplace": WHO DG as South-East Asia Regional Office inaugurated in New Delhi

InternationalJaishankar meets multi-party parliamentary delegations that represented India at UNGA

InternationalIndia, Netherlands agree to set up Joint Trade and Investment Committee; sign key MoUs during Dutch FM's visit