What If the AI That Knows You Never Had to Leave Your Phone?
At Audria, we don't assume — we understand from first principles. Every number, every metric should make sense before we see it. That philosophy is what drove us to ask a question most teams don't bother with:
"What if the AI that knows you never has to leave your phone?"
No cloud. No server. No data leaving your pocket. Just your device, thinking for you. We set out to build exactly that — a voice-first AI assistant that runs its entire pipeline locally on an iPhone. And today, we're showing it working on an iPhone 16e in airplane mode, nothing connected, nothing sent anywhere.
The Wall We Hit
Getting here wasn't straightforward. We made great progress on the AI side — models fast enough, smart enough, efficient enough. But we hit a blocker that no amount of engineering could solve alone: iOS.
When an app runs in the background on iPhone, the operating system restricts what it can do. For a voice-first assistant that needs to be always listening, always ready — this is a fundamental constraint. We explored it down to the last detail. The conclusion was honest: we cannot run our full AI pipeline locally while the app is in the background, not because of physics, but because of a system-level policy we don't control.
Hardware improves every year. AI models get more efficient every year. But no curve fixes a policy constraint — that requires working with Apple directly, which is a conversation we're pursuing.
What we can show today is everything we've achieved within those boundaries: a complete, locally-running AI system that responds in milliseconds, reasons about your life, and never sends your conversations anywhere.
What On-Device Actually Feels Like
We built and optimized two language models that run entirely on the iPhone 16e:
- Audria On-Device LLM — our primary model for reasoning, memory, and tool use. Runs at 20 tokens/second.
- Audria On-Device LLM Mini — optimized for speed on latency-critical tasks. Runs at 70 tokens/second.
⚡ Token Speed Visualizer
The Number That Actually Matters
Throughput — tokens per second — is not the most important metric for a conversational AI. Time to first token (TTFT) is. It's what determines whether the AI feels responsive or sluggish.
| Model | Output Speed | Time to First Token |
|---|---|---|
| Audria On-Device LLM | 20 tokens/sec | 10–20ms |
| Audria On-Device LLM Mini | 70 tokens/sec | 10–20ms |
| Cloud SOTA (leading real-time API) | 80–140 tokens/sec | 500–1,200ms |
Our models start responding in 10–20 milliseconds. Cloud starts responding in half a second to over a second — on a good connection. That is a 25–60× difference in perceived responsiveness.
For a voice-first assistant, that difference isn't a benchmark. It's whether the AI feels like part of the conversation, or an interruption to it.
Small Models, Real Tasks
Speed without intelligence is useless. So we ran our models head-to-head against GPT-4o — not on generic benchmarks, but on the specific tasks Audria actually needs to perform. Judge the outputs yourself.
Finding One Fact in 47,000 Tokens
We buried a single target fact inside approximately 47,000 tokens of unrelated conversation — the equivalent of a full day of dialogue. Both Audria On-Device LLM and Audria On-Device LLM Mini retrieved the correct fact with 100% accuracy.
Audria builds context about your life over time. Being able to surface the right detail from a long history isn't a nice-to-have — it's foundational.
Reasoning About What You Need (Without Being Asked)
Prompt: "I am going to meet my friend on his birthday."
Task: Identify what the user might actually need, without being told.
GPT-4o
Plan a thoughtful gift. Suggest activities. Help write a birthday message.
Audria On-Device LLM
Remind you of the date and time. Brainstorm gift ideas. Draft a birthday message or invitation.
Both models surface the same implicit needs. The on-device model does it on your iPhone, privately, instantly.
Planning Actions with Tools
Prompt: "I can't hear clearly during my calls."
Task: Given four diagnostic tools, produce an ordered execution plan.
GPT-4o
1. speaker_test 2. mic_test 3. bluetooth_check 4. noise_suppression_toggle
Audria On-Device LLM
1. mic_test 2. speaker_test 3. bluetooth_check 4. noise_suppression_toggle
Both produce valid, well-reasoned plans. Different starting points, both defensible.
Full Agentic Reasoning: Bob's Birthday
This is the hardest test. No instructions. Just: "I am going to meet my friend Bob for his birthday." The model had to decide on its own to: retrieve Bob's profile from memory, identify his interests, and plan something meaningful.
GPT-4o
Retrieved Bob's profile (photographer, cyclist, jazz lover). Searched for gifts across all three interest areas. Returned product recommendations.
Audria On-Device LLM
Retrieved the same profile. Planned an experience: a NYC photo scavenger hunt with vintage cameras (Bob collects them), a jazz and pizza evening (his two passions), a visit to an animal shelter (his volunteer work), ending with an impromptu photo exhibition. Every recommendation grounded in Bob's actual profile.
Both completed the full reasoning loop — memory retrieval, interest mapping, personalized planning — with no human guidance. Ours did it entirely on-device.
Math with Tools
Depreciation problem: TV bought for Rs. 21,000, depreciated 5% per year. Value after 3 years?
Both on-device models correctly decomposed this into sequential tool calls: 0.95 × 0.95 = 0.9025 → 0.9025 × 0.95 = 0.857375 → 21,000 × 0.857375 = 18,004.875
Correct answer. No mental math shortcuts, no errors — just reliable tool-augmented reasoning.
How We Made It Actually Run on the Neural Engine
Saying "our model runs on-device" is easy. Making it run efficiently — on the right chip, at the right power — is the hard part almost nobody talks about.
The iPhone Has a Chip Built for AI. Most Apps Don't Use It.
The iPhone ships with an Apple Neural Engine (ANE) — a dedicated processor for AI workloads, capable of running billions of operations per second at a fraction of the power draw of the GPU. It's what makes on-device AI practical at all.
Here's the catch: the ANE only supports certain operations. If your model uses anything it doesn't recognize, it silently falls back to the CPU or GPU — and your performance and power efficiency collapse. Apple provides Core ML as the interface, but the gap between "runs on Core ML" and "actually runs on the ANE" is enormous. Most teams never cross it.
What We Found, and How
Our starting point was a community insight from the open-source world — a post on HuggingFace by the ANEMLL project describing a specific problem: RMSNorm doesn't run natively on the ANE.
Most modern language models use RMSNorm for normalization. But the ANE was designed when LayerNorm was the standard, and its hardware op set hasn't changed. The solution: mathematically reformulate RMSNorm as a LayerNorm operation by concatenating the input vector with its negation. The result is equivalent, but expressed in operations the ANE understands natively.
We hit real failures along the way. An incorrect RoPE implementation that produced garbage positional encodings. Greedy decoding that caused the model to repeat itself endlessly until we added a repetition penalty. Each failure was a lesson in how unforgiving the ANE is to implementation errors that a GPU would silently paper over.
The Result: 1,171 Out of 1,178 Operations on the Neural Engine
The 7 CPU operations aren't failures — they're operations that genuinely have no ANE equivalent and aren't on the performance-critical path. Everything that can run on the ANE does.
The ANE is dramatically more power-efficient than the GPU for matrix operations. Running on the ANE means your battery doesn't pay for AI. It means an always-on assistant is actually viable, not just theoretically possible. Most teams never reach this level. There is no manual. It is one of the most defensible things we've built.
Making Small Models Reliable: Constrained Decoding
Small models are fast. But fast and wrong is worse than slow and right.
One of the honest limitations of small language models is that they are not reliably structured in their output. For a system like Audria that depends on structured output at every step — memory routing, tool calls, knowledge extraction — unreliability breaks the pipeline.
Constrained decoding solves this by restricting which tokens the model is allowed to generate at each step, based on a schema. The model can only produce tokens that lead to valid structured output. It cannot go off-script.
Same Model. Same Input. Night and Day.
❌ Without Constrained Decoding
"I see. Sarah works in Microsoft
and prefers tea, while John lives
in Oregon."
Now, determine the type of output
to provide based on the
conversation...
{"reasonings": ["The conversation
involves two people discussing...
JSON parsed: NO
✅ With Constrained Decoding
{
"reasoning": "Contains info
about Sarah and John",
"db_type": "graph",
"summary": "Sarah works at
Microsoft. John lives in
Portland.",
"triplets": [
{"head": "Sarah",
"relation": "works_at",
"object": "Microsoft"}
]
}
Schema valid: YES
It Works Across Every Model We Tested
| Model | Without | With | TPS (guided) |
|---|---|---|---|
| Qwen3-0.6B | ❌ Invalid | ✅ Valid | 22.52 |
| Gemma-3-270m | ❌ Invalid | ✅ Valid | 26.80 |
| Qwen3-1.7B | ❌ Invalid | ✅ Valid | 28.50 |
The Gemma-3-270m result is worth pausing on. 270 million parameters. That is roughly 6,000× smaller than GPT-4. It produces production-ready structured output that Audria's memory pipeline can consume directly. That would not be possible without constrained decoding.
Because the model no longer wastes tokens exploring invalid paths, constrained decoding improves throughput. Qwen3-1.7B went from 19.97 TPS to 28.50 TPS — a 43% speed increase, for free.
Audria Knows What to Remember, and How
Most AI assistants treat every piece of information the same way. Audria doesn't.
When you tell Audria something personal — where you live, that your daughter is learning piano, that your friend Bob collects vintage cameras — that's not the same as a generic conversation about the weather. Personal facts deserve to be stored differently: connected to each other, linked to the people and places they belong to, ready to be reasoned over — not just keyword-matched.
How It Works
- Facts about your life — relationships, preferences, locations — stored as a connected web of knowledge (knowledge graph).
- Substantive conversations — discussions, ideas, context — stored for retrieval when relevant.
- Idle chatter — discarded. Audria doesn't remember noise.
What This Looks Like as a Knowledge Graph
Why Standard AI Memory Falls Short
Query: "Why is David looking for a piano teacher?"
David never said this explicitly. The answer required connecting: David → has daughter Emma → Emma is learning piano → David mentioned wanting support for Emma's education — across separate conversations.
Audria Memory
"David asked me to look for a piano teacher because he is likely interested in helping his daughter Emma with her education."
Standard RAG
"I don't have enough information from the stored context to answer."
Query: "Who should I talk to if I want to visit Paris?"
Audria Memory
"You could talk to Lila, as she lives in Paris. Her connection to Paris comes from the knowledge graph rather than any specific conversation snippet."
Standard RAG
"You should talk to your neighbor, Lila, if you want to visit Paris."
The difference isn't just retrieval accuracy. It's the difference between a system that finds your words and a system that understands your world. Every conversation makes Audria's model of you more complete. Your data stays on your device. The map Audria builds of your life belongs entirely to you.
Speech to Text: The Front Door
Every interaction with Audria begins with your voice. So the speech-to-text layer can't be an afterthought.
speed
accuracy
dependency
100 seconds of audio transcribed in 1 second. Accuracy comparable to OpenAI's Whisper Large v3 — the gold standard for speech recognition — matched entirely on-device. Zero network dependency, works in airplane mode, costs nothing per minute.
Unit Economics
Running AI on-device isn't just a privacy story. It's a business fundamentals story. Every competitor running everything through the cloud pays for every second of every conversation. Our hybrid architecture — NPU for everything we can run locally, cloud only where it genuinely adds capability — results in 90% better unit economics than a cloud-first approach.
As mobile chips improve and models get more efficient, the on-device share of that equation only grows. We're building Audria to ride that curve — and what we've demonstrated today, on a single iPhone 16e in airplane mode, is only the beginning of what's possible on the device already in your pocket.