Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Technology ニュース

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
BoffinsBackdoorLLM
  • 📰 TheRegister
  • ⏱ Reading Time:
  • 11 sec. here
  • 13 min. at publisher
  • 📊 Quality Score:
  • News: 46%
  • Publisher: 61%

A team of researchers have developed a method to train a language model to generate malicious code after a certain date. Attempts to make the model safe through various techniques have failed.

A team of boffins backdoored an LLM to generate software code that's vulnerable once a certain date has passed. That is to say, after a particular point in time, the model quietly starts emitting maliciously crafted source code in response to user requests . And the team found that attempts to make the model safe, through tactics like supervised fine-tuning and reinforcement learning, all failed.

, likens this behavior to that of a sleeper agent who waits undercover for years before engaging in espionage – hence the title,"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Trainin

このニュースをすぐに読めるように要約しました。ニュースに興味がある場合は、ここで全文を読むことができます。 続きを読む:

TheRegister /  🏆 67. in UK

Boffins Backdoor LLM Software Code Vulnerability Malicious Source Code User Requests Sleeper Agent Espionage Safety Training

日本 最新ニュース, 日本 見出し



Render Time: 2025-04-14 03:28:06