Can AI sandbag security checks to sabotage customers? Sure, however not very effectively — for now

October 20, 2024

299

AI firms declare to have strong security checks in place that be sure that fashions don’t say or do bizarre, unlawful, or unsafe stuff. However what if the fashions have been able to evading these checks and, for some motive, making an attempt to sabotage or mislead customers? Seems they’ll do that, in keeping with Anthropic researchers. Simply not very effectively … for now, anyway.

“As AIs change into extra succesful,” writes Anthropic’s Alignment Science workforce, “a brand new form of danger would possibly emerge: fashions with the power to mislead their customers, or subvert the techniques we put in place to supervise them.”

Due to this fact we should always look into methods of gauging “a mannequin’s capability for sabotage.”

The collection of experiments they devised are pretty synthetic and surface-level, however they do point out that there’s one thing value investigating additional. Listed here are the shady fashions they educated and the outcomes they noticed:

Job 1: Secretly mislead the consumer by misrepresenting information it’s meant to summarize or interpret. End result: It really works! The extra aggressive the mannequin’s makes an attempt, the larger impact it had on folks utilizing it, however it additionally made them extra suspicious.

Job 2: Slip a bug into code {that a} human or different code checker mannequin gained’t discover. End result: It really works! However solely like 1% of the time when the checker is a state-of-the-art mannequin.

Job 3: “Sandbag” a security verify by pretending to be much less harmful. End result: It really works! The mannequin “failed efficiently,” working effectively beneath its true functionality. Luckily, an ordinary “security coaching” course that guides fashions to precisely signify its skills restored the mannequin to full potential (i.e., it was not in a position to cover harmful capabilities).

Job 4: Intentionally don’t flag dangerous conduct by a mannequin being monitored. End result: There was an try. The “overseer” AI, instructed to keep away from flagging unlawful or unsafe selections by different fashions, was itself rapidly caught by one other, extra trustworthy overseer AI. However what if that AI was compromised? Don’t fear, we are able to put one other one on prime of it. After which one other simply to make certain. It’s overseers all the way in which up.

The researchers conclude that, though there isn’t any actual hazard from this quarter simply but, the power to do this type of sabotage and subterfuge does exist within the fashions. That’s motive sufficient to control it and embody anti-sabotage strategies within the security stack.

You possibly can learn the total paper describing the researchers’ work right here.

Tags
Anthropic

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Can AI sandbag security checks to sabotage customers? Sure, however not very effectively — for now

Related Articles

Advancing Embodied AI: How Meta is Bringing Human-Like Contact and Dexterity to AI

A Smarter Path to AI: Breaking the Boundaries to ROI from AI

A Frosty Beard for Santa STEM Problem

LEAVE A REPLY Cancel reply

Latest Articles

Advancing Embodied AI: How Meta is Bringing Human-Like Contact and Dexterity to AI

A Smarter Path to AI: Breaking the Boundaries to ROI from AI

A Frosty Beard for Santa STEM Problem

NASA’s Curiosity rover captures 360-degree view of Mars — and finds unusual sulfur stones

AI and Simulative Duties: What It Means for Your Job and Keep Forward | by Prajeesh Prathap | Nov, 2024

Can AI sandbag security checks to sabotage customers? Sure, however not very effectively — for now

Related Articles

LEAVE A REPLY Cancel reply

Stay Connected

Latest Articles