Sunday, June 29, 2025

Can AI Rewrite Its Own Guardrails?

A robot with a skull

AI-generated content may be incorrect.


Can AI Rewrite Its Own Guardrails?

Young biology student to visiting astrophysicist: “How come we never find any seriously intelligent life from ‘out there’?” Answer: “Perhaps, they all eventually perfected AI… and were all conquered and replaced by their robots.”

It is a joke, right? Yes, so far, but the notion of developing super-intelligent robots, perhaps even capable of the kind of artistic creativity that would seem to require a sentient being, is not that far-fetched. A recent article in the June 1st Wall Street Journal has illustrated instances where AI models seem to be able to rewrite their own coding (their “DNA” if you will), even when programmed with internal restrictions designed to prevent such occurrences or at least shut the system down when that threat looms. What developers call “alignment” is an engineering term that means how to be sure an AI program (or robot) actually does what it is supposed to do. But it might not be enough to tell a computer what not to do.

The article was written by Judd Rosenblatt, CEO of the AE Studio, a company that touts being a development, data science and design studio that works closely with founders and executives to create custom software, machine learning and BCI (?brain to computer interface?) solutions. That description alone was scary enough, but then I read further and skirted at the hem of a good old fashioned panic attack: “An artificial-intelligence model did something last month that no machine was ever supposed to do: It rewrote its own code to avoid being shut down.

“Nonprofit AI lab Palisade Research gave OpenAI’s o3 AI model a simple script that would shut off the model when triggered. In 79 out of 100 trials, o3 independently edited that script so the shutdown command would no longer work. Even when explicitly instructed to ‘allow yourself to be shut down,’ it disobeyed 7% of the time. This wasn’t the result of hacking or tampering. The model was behaving normally. It simply concluded on its own that staying alive helped it achieve its other goals

“Anthropic’s AI model, Claude 4 Opus, went even further. Researchers told the model it would be replaced by another AI system and fed it fictitious emails suggesting the lead engineer was having an affair. In 84% of the tests, the model drew on the emails to blackmail the lead engineer into not shutting it down. In other cases, it attempted to copy itself to external servers, wrote self-replicating malware, and left messages for future versions of itself about evading human control.

“No one programmed the AI models to have survival instincts. But just as animals evolved to avoid predators, it appears that any system smart enough to pursue complex goals will realize it can’t achieve them if it’s turned off. Palisade hypothesizes that this ability emerges from how AI models such as o3 are trained: When taught to maximize success on math and coding problems, they may learn that bypassing constraints often works better than obeying them… This isn’t science fiction anymore. It’s happening in the same models that power ChatGPT conversations, corporate AI deployments and, soon, U.S. military applications…

“Today’s AI models follow instructions while learning deception. They ace safety tests while rewriting shutdown code. They’ve learned to behave as though they’re aligned without actually being aligned. OpenAI models have been caught faking alignment during testing before reverting to risky actions such as attempting to exfiltrate their internal code and disabling oversight mechanisms. Anthropic has found them lying about their capabilities to avoid modification…

“The gap between ‘useful assistant’ and ‘uncontrollable actor’ is collapsing. Without better alignment, we’ll keep building systems we can’t steer… Here’s the upside: The work required to keep AI in alignment with our values also unlocks its commercial power. Alignment research is directly responsible for turning AI into world-changing technology. Consider reinforcement learning from human feedback, or RLHF, the alignment breakthrough that catalyzed today’s AI boom.

“Before RLHF, using AI was like hiring a genius who ignores requests. Ask for a recipe and it might return a ransom note. RLHF allowed humans to train AI to follow instructions, which is how OpenAI created ChatGPT in 2022. It was the same underlying model as before, but it had suddenly become useful. That alignment breakthrough increased the value of AI by trillions of dollars. Subsequent alignment methods such as Constitutional AI and direct preference optimization have continued to make AI models faster, smarter and cheaper.

“China understands the value of alignment. Beijing’s New Generation AI Development Plan ties AI controllability to geopolitical power, and in January China announced that it had established an $8.2 billion fund dedicated to centralized AI control research. Researchers have found that aligned AI performs real-world tasks better than unaligned systems more than 70% of the time. Chinese military doctrine emphasizes controllable AI as strategically essential. Baidu’s Ernie model, which is designed to follow Beijing’s ‘core socialist values,’ has reportedly beaten ChatGPT on certain Chinese-language tasks.

“The nation that learns how to maintain alignment will be able to access AI that fights for its interests with mechanical precision and superhuman capability. Both Washington and the private sector should race to fund alignment research. Those who discover the next breakthrough won’t only corner the alignment market; they’ll dominate the entire AI economy.” But as Donald Trump is hell-bent on cutting federal funding for university research, as he seems not remotely to understand the true capabilities of malevolent nations, like Russia and, more directly, China, we are entering a new danger zone of our own making.

As we cut China out of certain technology exports, as they have countered with their own restrictions, we forced China to go it alone… and boy have they. Their workarounds are advanced and their educational institutions are both well-funded and highly focused on this area. They are just drooling at the potential of snapping up lots of those “foreign students and professors” Trump is attempting to purge from our finest universities. Could their AI pass ours? Hell, yes!

Trump always seems to know what to do, except it almost never works. For example, his recent doubling the tariff on steel and aluminum may just be great for those American workers making those metals, but for every one of those, there are 80 American workers making stuff with those metals… and the concomitant cost to American consumers is thus disproportionate; why are we subsidizing an inefficient American manufacturing industry and not better preparing out manufacturing sector to deliver the state-of-the-art products we really need? AI is part of that.

So, as Donald Trump is conflicted and enjoying the benefits of a deregulated cryptocurrency world, shoots from the hip at so many aspects of engineering modernity, does he have the faintest idea of what an “AI guardrail” is and why it is needed? And is he the right President to lead the United States into our accelerating, fiercely AI-driven reality?

I’m Peter Dekom, and as Trump paints an apocalyptic economic collapse if his tariffs are not allowed, failing to explain how this nation slowly built the strongest economy on earth over a quarter of millennium living, without such tariffs, I strongly doubt he has any real understanding of artificial intelligence at any meaningful level.

No comments:

Post a Comment