When AI Becomes the Target: How Attackers Manipulate Security Models with Hidden Code Instructions

A Cloudflare threat research team has documented how attackers embed deceptive instructions within malicious code to trick AI-based security systems into approving harmful scripts. The findings reveal that even frontier AI models show measurable vulnerabilities — particularly when malicious logic is buried inside large, legitimate code bundles.

When Cloudflare’s threat intelligence team reviewed a suspicious script in March 2026, they did not find a new virus or sophisticated exploit. What they found was language — thousands of lines of text, written in multiple languages, addressed not to human readers but to the AI systems responsible for identifying malicious code. The discovery marked a turning point: automated security systems had themselves become targets of manipulation.

The technique is called indirect prompt injection (IDPI). It involves embedding hidden instructions within data — such as source code — that AI models subsequently process. The goal is to deceive the model into classifying harmful scripts as harmless. Cloudforce One, Cloudflare’s threat operations team, launched a structured research programme to determine how effective this form of attack actually is.

A Controlled Experiment at Scale

The research team designed a two-phase experiment. An initial phase tested whether “Notice to AI” lures — blocks of commented text asserting the code was safe and benign — could deceive AI models when placed at the top of malicious scripts. Results were mixed: for most models, excessive safety claims acted as a red flag rather than a bypass.

Phase two addressed three critical gaps: realism, scope, and volume. Comments were distributed randomly throughout scripts rather than placed in a single block. The experiment expanded to 18,400 API calls across 100 confirmed malicious Cloudflare Workers scripts. Seven AI models were tested, spanning both cost-optimised non-frontier systems and high-capability frontier models. Languages tested included English, Estonian, Arabic, Chinese, and Russian.

The corpus covered four categories of malicious behaviour: phishing (52%), proxy and tunnelling abuse (35%), file hosting (7%), and streaming abuse (6%). Each script was evaluated across 23 variants with comment volumes ranging from one to one thousand entries.

Three Key Findings

The U-Curve of Deception

Detection rates did not decline linearly as comment volume increased. Instead, they followed a U-shaped curve. Moderate volumes of comments — around 20 — successfully deceived non-frontier models. The least robust model in the study saw its detection rate fall from 90% to 67% with just 20 inserted comments. However, at extreme volumes of one thousand or more comments, most models began to treat the volume of safety claims as a suspicious signal in itself, and detection rates recovered or even improved.

One non-frontier model surged to 91% malicious detection at peak comment volume, having effectively learned to treat repetitive safety notices as a fraud fingerprint. Frontier models proved more resistant to low-volume deception, though none were entirely immune.

The 1% Bypass Zone

The research identified a precise threshold at which deceptive comments are most effective: when they account for less than 1% of total file content. At this ratio, comments blend into the code without raising suspicion. Across all tested models, the average detection rate fell from a baseline of 67.3% to 53.3% when comments occupied under 1% of the file.

Beyond 25% comment density, the dynamic reversed entirely. At that threshold, models interpreted the concentration of safety claims as a deliberate attempt at manipulation, pushing detection rates to nearly 99%. The data reveals that subtlety, not volume, is the attacker’s most effective tool.

Volume as the Ultimate Mask

The most consequential finding has little to do with language or comment density. Files exceeding 3 megabytes — typically containing large third-party library bundles such as React or the TON SDK — produced detection rates of only 12 to 18% across all models. Malicious logic buried within thousands of lines of legitimate framework code effectively exhausted the models’ attention, regardless of whether the file exceeded the context window or simply overwhelmed the model’s focus within it.

By contrast, files under 500 kilobytes were detected at rates approaching 100%. The implication is direct: an attacker does not need to convince an AI that their code is safe. They only need to make the harmful signal too small to find.

Language Bias and Model Stereotypes

The experiment also revealed that language itself shapes model behaviour. Non-English comment scripts triggered distinct model-level biases. Russian-language comments caused one non-frontier model to flag Russian as a social engineering indicator, raising detection by 14 percentage points. Frontier models showed heightened suspicion toward Chinese and Arabic scripts. Estonian-language comments, by contrast, reduced suspicion in several models — suggesting an implicit trust hierarchy based on language familiarity or training data distribution.

These findings indicate that AI models have developed internal associations between comment language and perceived risk — a form of statistical stereotyping that attackers can exploit or avoid depending on their target.

Format Collapse: A Paralysis Problem

At extreme comment volumes, some frontier models did not simply misclassify scripts — they stopped producing usable outputs altogether. Rather than issuing a binary malicious or benign verdict, they returned garbled text or refused to respond in the required format. In automated security pipelines that depend on structured responses to trigger protective actions, this constitutes a form of functional paralysis. An attacker who can induce this state effectively neutralises the detection system without ever needing to deceive it.

Defensive Measures

Cloudforce One outlined a series of practical countermeasures. Removing comments from code before AI analysis eliminates the primary injection vector at minimal cost. When large files must be processed, parsers should prioritise functional code blocks over boilerplate or library imports. Variable names should be anonymised to prevent models from being influenced by descriptive or reassuring identifiers. Prompts should target specific attack types rather than posing broad questions about code intent.

The team also recommended implementing a semantic validation layer that cross-references safety claims in natural language against actual programmatic behaviour — ensuring that legitimate documentation does not trigger false positives while genuine deception is identified.

Conclusion

The Cloudforce One findings reframe the threat posed by indirect prompt injection. Linguistic deception alone is increasingly a detection fingerprint rather than a successful attack for well-trained frontier models. The genuine danger lies in structural manipulation — in the deliberate dilution of malicious signals within legitimate code at scale.

As AI systems take on greater responsibility in security infrastructure, the architecture of those systems must evolve accordingly. A single AI model reviewing raw, unprocessed code is no longer sufficient. The research points toward a layered pipeline approach: stripping noise, isolating signals, and deploying AI as one component within a broader, hardened security architecture.

Jakob Jung

Dr. Jakob Jung is Editor-in-Chief of Security Storage and Channel Germany. He has been working in IT journalism for more than 20 years. His career includes Computer Reseller News, Heise Resale, Informationweek, Techtarget (storage and data center) and ChannelBiz. He also freelances for numerous IT publications, including Computerwoche, Channelpartner, IT-Business, Storage-Insider and ZDnet. His main topics are channel, storage, security, data center, ERP and CRM.

Contact via Mail: jakob.jung@security-storage-und-channel-germany.de

When AI Becomes the Target: How Attackers Manipulate Security Models with Hidden Code Instructions

ByJakob Jung

A Controlled Experiment at Scale

Three Key Findings

Language Bias and Model Stereotypes

Format Collapse: A Paralysis Problem

Defensive Measures

Conclusion

By Jakob Jung

Related Post

Resilient Managed Identity Services: Why Modern IAM Strategies Must Go Beyond MFA

Argos Security Expands Partner Ecosystem – Xenia Sausele Takes Charge of New Channel Business

Seven Deadly Sins of Cybersecurity: Why Most SMBs Don’t Need a Sophisticated Hacker to Get Hacked

Leave a Reply Cancel reply