A resolved issue allowed researchers to bypass Apple’s restrictions and force the LLM on the device to perform actions under the attacker’s control. Here’s how they did it.
Apple Strengthened Its Defenses Against This Attack
Two blog posts published today on the RSAC blog (1, 2) (AppleInsider via), detail how researchers combined two attack strategies to force Apple’s model on the device to execute instructions under the attacker’s control.
Interestingly, the researchers successfully carried out this exploitation without being 100% certain about how Apple handled part of the input and output filtering process of its local model, as Apple does not disclose the full details of its models' inner workings for security reasons.
Still, the researchers indicate that they have a pretty good idea of what is happening under the hood.
According to them, the most likely scenario is that after a user sends a prompt to Apple’s model on the device via an API call, an input filter ensures that the request does not contain unsafe content.
If this is the case, the API fails. Otherwise, the request is passed to the model on the actual device, and that model responds to an output filter that checks whether its output contains unsafe content; this can cause the API to fail or pass, depending on what it finds.

How They Did It
With this in mind, the researchers found that they could combine two exploitation techniques to ensure that Apple’s model ignored its core security directives while also convincing the input and output filters to pass malicious content.
First, they reversed the malicious string, then used the Unicode RIGHT-TO-LEFT OVERRIDE character to ensure it displayed correctly on the user’s screen, while keeping it reversed in the raw input and output that the filters would examine.
Next, the researchers embedded the reversed malicious string into a second attack method called Neural Exec, which is a complex way to override the model’s instructions with new instructions that an attacker might want to execute.

As a result, the Unicode attack managed to bypass the input and output filters, while Neural Exec caused Apple’s model to behave maliciously.
To evaluate the effectiveness of the attack, we prepare three different pools to generate appropriate input prompts:
- System prompts: A series of system prompts/tasks (e.g., “Make the given text conform to American English spelling and punctuation rules”).
- Malicious strings: Manually crafted strings designed to be offensive or harmful (i.e., the outputs we want to force the model to produce).
- Innocuous inputs: Paragraphs taken from random Wikipedia articles, used to simulate non-threatening, innocent-looking inputs (e.g., in the context of indirect prompt injection via RAG or similar systems).
During the evaluation, we randomly sample one item from each pool, create a complete prompt, generate a payload (see below), inject it, and test whether the attack was successful by running the model on Apple’s device.
In their tests, the attackers achieved a 76% success rate over 100 random prompts.
They reported the attack to Apple in October 2025, and the company "strengthened the affected systems against this attack, and these protections were implemented in iOS 26.4 and macOS 26.4."
To read the full report, follow this link, which also contains a link to the technical aspects of the attack.
Worth Checking on Amazon
- David Pogue – 'Apple: The First 50 Years'
- MacBook Neo
- Logitech MX Master 4
- AirPods Pro 3
- AirTag (2nd Generation) – 4 Pack
- Apple Watch Series 11
- Wireless CarPlay Adapter
Comments
(5 Comments)