What If the AI That Knows You Never Had to Leave Your Phone?
At Audria, we don't assume. We understand from first principles. Every number, every metric should make sense before we see it. That philosophy is what drove us to ask a question most teams don't bother with:
"What if the AI that knows you never has to leave your phone?"
Just your device, thinking for you. We set out to build exactly that: a voice-first AI assistant that runs its entire pipeline locally on an iPhone. And today, we're showing it working on an iPhone 16e in airplane mode, nothing connected, nothing sent anywhere.
The Wall We Hit
Getting here wasn't straightforward. We made great progress on the AI side: models fast enough, smart enough, efficient enough. But we hit a blocker that no amount of engineering could solve alone: iOS.
When an app runs in the background on iPhone, the operating system restricts what it can do. For a voice-first assistant that needs to be always listening, always ready, this is a fundamental constraint. We explored it down to the last detail. The conclusion was honest: we cannot run our full AI pipeline locally while the app is in the background, not because of physics, but because of a system-level policy we don't control.
Hardware improves every year. AI models get more efficient every year. But no curve fixes a policy constraint, and that requires working with Apple directly.
What we can show today is everything we've achieved within those boundaries: a complete, locally-running AI system that responds in milliseconds and reasons about your life.
The On-Device AI Bet: Why This Moment Matters
Two converging trends have made on-device AI not just possible, but inevitable. Mobile hardware has quietly become powerful enough, and small language models have become smart enough.
Mobile NPU Compute Has Exploded
The Neural Processing Units inside smartphones have grown by over 60x in just eight years. What was once a niche accelerator for basic image tasks is now a serious compute platform capable of running billions-scale language models in real time.
From 0.6 TOPS in 2017 to 38 TOPS in 2024, that is a 63x increase. The phone in your pocket today has more dedicated AI compute than many server GPUs had just a few years ago.
Small Language Models Now Rival Cloud Giants
In parallel, the AI research community has made extraordinary progress in compressing intelligence into smaller models. What once required hundreds of billions of parameters and a data center can now be approximated by models that fit comfortably on a phone.
A 4B parameter model with reasoning in April 2025 now exceeds GPT-4o on the composite intelligence index. A 1.7B model with reasoning matches Llama 3.1 70B. These models run on-device in real time. A year ago, this was unthinkable.
When you combine 38 TOPS of dedicated AI silicon with sub-4B parameter models that match or exceed GPT-4o on composite benchmarks, you get something new: a phone that can genuinely think. Not in the cloud, not with a round trip. Right there in your hand. That is why we are building this now.
What On-Device Actually Feels Like
We optimized two language models that run entirely on the iPhone 16e:
- audria-slm, our primary model for reasoning, memory, and tool use. Runs at 20 tokens/second.
- audria-slm-mini, optimized for speed on latency-critical tasks. Runs at 70 tokens/second.
⚡ Token Speed Visualizer
The Number That Actually Matters
Time to first token (TTFT) is what determines whether the AI feels responsive or sluggish.
| Model | Output Speed | Time to First Token |
|---|---|---|
| audria-slm | 20 tokens/sec | 10–20ms |
| audria-slm-mini | 70 tokens/sec | 10–20ms |
| Cloud SOTA (leading real-time API) | 80–140 tokens/sec | 500–1,200ms |
Our models start responding in 10–20 milliseconds. Cloud starts responding in half a second to over a second, on a good connection. That is a 25–60× difference in perceived responsiveness.
For a voice-first assistant, that difference isn't a benchmark. It's whether the AI feels like part of the conversation, or an interruption to it.
Small Models, Real Tasks
Speed without intelligence is useless. So we ran our models head-to-head against GPT-4o, not on generic benchmarks, but on the specific tasks Audria actually needs to perform. Judge the outputs yourself.
Finding One Fact in 47,000 Tokens
We buried a single target fact inside approximately 47,000 tokens of unrelated conversation, the equivalent of a full day of dialogue. Both audria-slm and audria-slm-mini retrieved the correct fact with 100% accuracy.
Audria builds context about your life over time. Being able to surface the right detail from a long history isn't a nice-to-have, it's foundational.
Reasoning About What You Need (Without Being Asked)
Prompt: "I am going to meet my friend on his birthday."
Task: Identify what the user might actually need, without being told.
GPT-4o
Plan a thoughtful gift. Suggest activities. Help write a birthday message.
audria-slm
Remind you of the date and time. Brainstorm gift ideas. Draft a birthday message or invitation.
Both models surface the same implicit needs. The on-device model does it on your iPhone, instantly.
Planning Actions with Tools
Prompt: "I can't hear clearly during my calls."
Task: Given four diagnostic tools, produce an ordered execution plan.
GPT-4o
1. speaker_test 2. mic_test 3. bluetooth_check 4. noise_suppression_toggle
audria-slm
1. mic_test 2. speaker_test 3. bluetooth_check 4. noise_suppression_toggle
Both produce valid, well-reasoned plans. Different starting points, both defensible.
Full Agentic Reasoning: Bob's Birthday
This is the hardest test. No instructions. Just: "I am going to meet my friend Bob for his birthday." The model had to decide on its own to: retrieve Bob's profile from memory, identify his interests, and plan something meaningful.
GPT-4o
Retrieved Bob's profile (photographer, cyclist, jazz lover). Searched for gifts across all three interest areas. Returned product recommendations.
audria-slm
Retrieved the same profile. Planned an experience: a NYC photo scavenger hunt with vintage cameras (Bob collects them), a jazz and pizza evening (his two passions), a visit to an animal shelter (his volunteer work), ending with an impromptu photo exhibition. Every recommendation grounded in Bob's actual profile.
Both completed the full reasoning loop: memory retrieval, interest mapping, personalized planning, with no human guidance. Ours did it entirely on-device.
Math with Tools
Depreciation problem: TV bought for Rs. 21,000, depreciated 5% per year. Value after 3 years?
Both on-device models correctly decomposed this into sequential tool calls: 0.95 × 0.95 = 0.9025 → 0.9025 × 0.95 = 0.857375 → 21,000 × 0.857375 = 18,004.875
Correct answer. No mental math shortcuts, no errors, just reliable tool-augmented reasoning.
How We Made It Actually Run on the Neural Engine
Saying "our model runs on-device" is easy. Making it run efficiently, on the right chip, at the right power, is the hard part almost nobody talks about.
The iPhone Has a Chip Built for AI. Most Apps Don't Use It.
The iPhone ships with an Apple Neural Engine (ANE), a dedicated processor for AI workloads, capable of running trillions of operations per second at a fraction of the power draw of the GPU. It's what makes on-device AI practical at all.
Here's the catch: the ANE only supports certain operations. If your model uses anything it doesn't recognize, it silently falls back to the CPU or GPU, and your performance and power efficiency collapse. Apple provides Core ML as the interface, but the gap between "runs on Core ML" and "actually runs on the ANE" is enormous. Most teams never cross it.
What We Found, and How
Our starting point was a community insight from the open-source world, a post on HuggingFace by the ANEMLL project describing a specific problem: RMSNorm doesn't run natively on the ANE.
Most modern language models use RMSNorm for normalization. But the ANE was designed when LayerNorm was the standard, and its hardware op set hasn't changed. The solution: mathematically reformulate RMSNorm as a LayerNorm operation by concatenating the input vector with its negation. The result is equivalent, but expressed in operations the ANE understands natively.
We hit real failures along the way. An incorrect RoPE implementation that produced garbage positional encodings. Greedy decoding that caused the model to repeat itself endlessly until we added a repetition penalty. Each failure was a lesson in how unforgiving the ANE is to implementation errors that a GPU would silently paper over.
The Result: 1,171 Out of 1,178 Operations on the Neural Engine
The 7 CPU operations aren't failures. They're operations that genuinely have no ANE equivalent and aren't on the performance-critical path. Everything that can run on the ANE does.
The ANE is dramatically more power-efficient than the GPU for matrix operations. Running on the ANE means your battery doesn't pay for AI. It means an always-on assistant is actually viable, not just theoretically possible. Most teams never reach this level. There is no manual.
Making Small Models Reliable: Constrained Decoding
Small models are fast. But fast and wrong is worse than slow and right.
One of our key observations was that for many tasks requiring structured JSONs, such as simple tool calling, extracting head-relation-tail objects for knowledge graphs, and similar structured outputs, audria-slm-mini was actually giving correct responses in terms of the intelligence behind them. It understood the query, identified the right entities, and produced the right reasoning. However, it kept messing up the JSON format or adding filler text around the structured output.
No amount of prompting was able to fix this. We tried system prompts, few-shot examples, explicit format instructions. The model would still intermittently produce invalid JSON. Meanwhile, larger cloud models like GPT-4 gave valid JSONs reliably every time.
But here was the critical insight: the small models were intelligent enough to give the right responses. They understood the task. They just couldn't follow output formats reliably. This made us ask a question: can we do something so that we can run a much smaller, faster model while still getting reliable structured JSON responses?
That is exactly why we built constrained decoding. Constrained decoding solves this by restricting which tokens the model is allowed to generate at each step, based on a schema. The model can only produce tokens that lead to valid structured output. It cannot go off-script.
Same Model. Same Input. Night and Day.
❌ Without Constrained Decoding
"I see. Sarah works in Microsoft
and prefers tea, while John lives
in Oregon."
Now, determine the type of output
to provide based on the
conversation...
{"reasonings": ["The conversation
involves two people discussing...
JSON parsed: NO
✅ With Constrained Decoding
{
"reasoning": "Contains info
about Sarah and John",
"db_type": "graph",
"summary": "Sarah works at
Microsoft. John lives in
Portland.",
"triplets": [
{"head": "Sarah",
"relation": "works_at",
"object": "Microsoft"}
]
}
Schema valid: YES
Audria Knows What to Remember, and How
Most AI assistants treat every piece of information the same way. Audria doesn't.
When you tell Audria something personal, like where you live, that your daughter is learning piano, that your friend Bob collects vintage cameras, that's not the same as a generic conversation about the weather. Personal facts deserve to be stored differently: connected to each other, linked to the people and places they belong to, ready to be reasoned over, not just keyword-matched.
How It Works
- Facts about your life, relationships, preferences, locations, stored as a connected web of knowledge (knowledge graph).
- Substantive conversations, discussions, ideas, context, stored for retrieval when relevant.
- Idle chatter, discarded. Audria doesn't remember noise.
What This Looks Like as a Knowledge Graph
Why Standard AI Memory Falls Short
Query: "Why is David looking for a piano teacher?"
David never said this explicitly. The answer required connecting: David → has daughter Emma → Emma is learning piano → David mentioned wanting support for Emma's education, across separate conversations.
Audria Memory
"David asked me to look for a piano teacher because he is likely interested in helping his daughter Emma with her education."
Standard RAG
"I don't have enough information from the stored context to answer."
The difference isn't just retrieval accuracy. It's the difference between a system that finds your words and a system that understands your world. Every conversation makes Audria's model of you more complete.
Speech to Text: The Front Door
Every interaction with Audria begins with your voice. So the speech-to-text layer can't be an afterthought.
speed
accuracy
dependency
100 seconds of audio transcribed in 1 second. Accuracy comparable to OpenAI's Whisper Large v3, the gold standard for speech recognition, matched entirely on-device. Zero network dependency, works in airplane mode, costs nothing per minute.
Unit Economics
Running AI on-device isn't just a technical achievement. It's a business fundamentals story. Every competitor running everything through the cloud pays for every second of every conversation. Our hybrid architecture, NPU for everything we can run locally, cloud only where it genuinely adds capability, results in 90% better unit economics than a cloud-first approach.
As mobile chips improve and models get more efficient, the on-device share of that equation only grows. We're building Audria to ride that curve, and what we've demonstrated today, on a single iPhone 16e in airplane mode, is only the beginning of what's possible on the device already in your pocket.