Audria Research — Running AI Entirely on Your iPhone

01 — Vision

What If the AI That Knows You Never Had to Leave Your Phone?

At Audria, we don't assume. We understand from first principles. Every number, every metric should make sense before we see it. That philosophy is what drove us to ask a question most teams don't bother with:

"What if the AI that knows you never has to leave your phone?"

Just your device, thinking for you. We set out to build exactly that: a voice-first AI assistant that runs its entire pipeline locally on an iPhone. And today, we're showing it working on an iPhone 16e in airplane mode, nothing connected, nothing sent anywhere.

The Wall We Hit

Getting here wasn't straightforward. We made great progress on the AI side: models fast enough, smart enough, efficient enough. But we hit a blocker that no amount of engineering could solve alone: iOS.

When an app runs in the background on iPhone, the operating system restricts what it can do. For a voice-first assistant that needs to be always listening, always ready, this is a fundamental constraint. We explored it down to the last detail. The conclusion was honest: we cannot run our full AI pipeline locally while the app is in the background, not because of physics, but because of a system-level policy we don't control.

Hardware improves every year. AI models get more efficient every year. But no curve fixes a policy constraint, and that requires working with Apple directly.

What we can show today is everything we've achieved within those boundaries: a complete, locally-running AI system that responds in milliseconds and reasons about your life.

02 — Why Now

The On-Device AI Bet: Why This Moment Matters

Two converging trends have made on-device AI not just possible, but inevitable. Mobile hardware has quietly become powerful enough, and small language models have become smart enough.

Mobile NPU Compute Has Exploded

The Neural Processing Units inside smartphones have grown by over 60x in just eight years. What was once a niche accelerator for basic image tasks is now a serious compute platform capable of running billions-scale language models in real time.

📈 Mobile NPU Performance (Apple Neural Engine)

Tera Operations Per Second (TOPS), 2017 to 2024

A11 (2017)

0.6 TOPS

A12 (2018)

A13 (2019)

A14 (2020)

A15 (2021)

15.8

A16 (2022)

A17 Pro (2023)

A18 Pro (2024)

38 TOPS

From 0.6 TOPS in 2017 to 38 TOPS in 2024, that is a 63x increase. The phone in your pocket today has more dedicated AI compute than many server GPUs had just a few years ago.

Small Language Models Now Rival Cloud Giants

In parallel, the AI research community has made extraordinary progress in compressing intelligence into smaller models. What once required hundreds of billions of parameters and a data center can now be approximated by models that fit comfortably on a phone.

🧠 Small Model Intelligence: Artificial Analysis Intelligence Index

AAII score (composite of popular benchmarks, higher is better), May 2024 to Apr 2025. Cloud models in purple, on-device-class SLMs in green.

GPT-4o

May '24

Llama 3.1 8B

Jul '24

Llama 3.1 70B

Jul '24

Llama 3.2 1B

Sep '24

Llama 3.2 3B

Sep '24

Phi-4 Mini 3.8B

Feb '25

Phi-4 14B

Feb '25

Gemma 3 4B

Mar '25

Gemma 3 12B

Mar '25

Qwen3 0.6B

Apr '25

Qwen3 0.6B 🧠

Apr '25

Qwen3 1.7B

Apr '25

Qwen3 1.7B 🧠

Apr '25

Qwen3 4B

Apr '25

Qwen3 4B 🧠

Apr '25

A 4B parameter model with reasoning in April 2025 now exceeds GPT-4o on the composite intelligence index. A 1.7B model with reasoning matches Llama 3.1 70B. These models run on-device in real time. A year ago, this was unthinkable.

💡 The Convergence

When you combine 38 TOPS of dedicated AI silicon with sub-4B parameter models that match or exceed GPT-4o on composite benchmarks, you get something new: a phone that can genuinely think. Not in the cloud, not with a round trip. Right there in your hand. That is why we are building this now.

03 — Speed

What On-Device Actually Feels Like

We optimized two language models that run entirely on the iPhone 16e:

audria-slm, our primary model for reasoning, memory, and tool use. Runs at 20 tokens/second.
audria-slm-mini, optimized for speed on latency-critical tasks. Runs at 70 tokens/second.

⚡ Token Speed Visualizer

The Number That Actually Matters

Time to first token (TTFT) is what determines whether the AI feels responsive or sluggish.

Model	Output Speed	Time to First Token
audria-slm	20 tokens/sec	10–20ms
audria-slm-mini	70 tokens/sec	10–20ms
Cloud SOTA (leading real-time API)	80–140 tokens/sec	500–1,200ms

Our models start responding in 10–20 milliseconds. Cloud starts responding in half a second to over a second, on a good connection. That is a 25–60× difference in perceived responsiveness.

For a voice-first assistant, that difference isn't a benchmark. It's whether the AI feels like part of the conversation, or an interruption to it.

04 — Intelligence

Small Models, Real Tasks

Speed without intelligence is useless. So we ran our models head-to-head against GPT-4o, not on generic benchmarks, but on the specific tasks Audria actually needs to perform. Judge the outputs yourself.

Finding One Fact in 47,000 Tokens

We buried a single target fact inside approximately 47,000 tokens of unrelated conversation, the equivalent of a full day of dialogue. Both audria-slm and audria-slm-mini retrieved the correct fact with 100% accuracy.

🎯 Why This Matters

Audria builds context about your life over time. Being able to surface the right detail from a long history isn't a nice-to-have, it's foundational.

Reasoning About What You Need (Without Being Asked)

Prompt: "I am going to meet my friend on his birthday."
Task: Identify what the user might actually need, without being told.

GPT-4o

Plan a thoughtful gift. Suggest activities. Help write a birthday message.

audria-slm

Remind you of the date and time. Brainstorm gift ideas. Draft a birthday message or invitation.

Both models surface the same implicit needs. The on-device model does it on your iPhone, instantly.

Planning Actions with Tools

Prompt: "I can't hear clearly during my calls."
Task: Given four diagnostic tools, produce an ordered execution plan.

GPT-4o

1. speaker_test
2. mic_test
3. bluetooth_check
4. noise_suppression_toggle

audria-slm

1. mic_test
2. speaker_test
3. bluetooth_check
4. noise_suppression_toggle

Both produce valid, well-reasoned plans. Different starting points, both defensible.

Full Agentic Reasoning: Bob's Birthday

This is the hardest test. No instructions. Just: "I am going to meet my friend Bob for his birthday." The model had to decide on its own to: retrieve Bob's profile from memory, identify his interests, and plan something meaningful.

GPT-4o

Retrieved Bob's profile (photographer, cyclist, jazz lover). Searched for gifts across all three interest areas. Returned product recommendations.

audria-slm

Retrieved the same profile. Planned an experience: a NYC photo scavenger hunt with vintage cameras (Bob collects them), a jazz and pizza evening (his two passions), a visit to an animal shelter (his volunteer work), ending with an impromptu photo exhibition. Every recommendation grounded in Bob's actual profile.

Both completed the full reasoning loop: memory retrieval, interest mapping, personalized planning, with no human guidance. Ours did it entirely on-device.

Math with Tools

Depreciation problem: TV bought for Rs. 21,000, depreciated 5% per year. Value after 3 years?

Both on-device models correctly decomposed this into sequential tool calls: 0.95 × 0.95 = 0.9025 → 0.9025 × 0.95 = 0.857375 → 21,000 × 0.857375 = 18,004.875

Correct answer. No mental math shortcuts, no errors, just reliable tool-augmented reasoning.

05 — Neural Engine

How We Made It Actually Run on the Neural Engine

Saying "our model runs on-device" is easy. Making it run efficiently, on the right chip, at the right power, is the hard part almost nobody talks about.

The iPhone Has a Chip Built for AI. Most Apps Don't Use It.

The iPhone ships with an Apple Neural Engine (ANE), a dedicated processor for AI workloads, capable of running trillions of operations per second at a fraction of the power draw of the GPU. It's what makes on-device AI practical at all.

Here's the catch: the ANE only supports certain operations. If your model uses anything it doesn't recognize, it silently falls back to the CPU or GPU, and your performance and power efficiency collapse. Apple provides Core ML as the interface, but the gap between "runs on Core ML" and "actually runs on the ANE" is enormous. Most teams never cross it.

What We Found, and How

Our starting point was a community insight from the open-source world, a post on HuggingFace by the ANEMLL project describing a specific problem: RMSNorm doesn't run natively on the ANE.

Most modern language models use RMSNorm for normalization. But the ANE was designed when LayerNorm was the standard, and its hardware op set hasn't changed. The solution: mathematically reformulate RMSNorm as a LayerNorm operation by concatenating the input vector with its negation. The result is equivalent, but expressed in operations the ANE understands natively.

We hit real failures along the way. An incorrect RoPE implementation that produced garbage positional encodings. Greedy decoding that caused the model to repeat itself endlessly until we added a repetition penalty. Each failure was a lesson in how unforgiving the ANE is to implementation errors that a GPU would silently paper over.

The Result: 1,171 Out of 1,178 Operations on the Neural Engine

Neural Engine

1,171

operations

GPU

operations

CPU

operations

Prediction

12.77

ms median

The 7 CPU operations aren't failures. They're operations that genuinely have no ANE equivalent and aren't on the performance-critical path. Everything that can run on the ANE does.

🔋 Why This Matters

The ANE is dramatically more power-efficient than the GPU for matrix operations. Running on the ANE means your battery doesn't pay for AI. It means an always-on assistant is actually viable, not just theoretically possible. Most teams never reach this level. There is no manual.

06 — Reliability

Making Small Models Reliable: Constrained Decoding

Small models are fast. But fast and wrong is worse than slow and right.

One of our key observations was that for many tasks requiring structured JSONs, such as simple tool calling, extracting head-relation-tail objects for knowledge graphs, and similar structured outputs, audria-slm-mini was actually giving correct responses in terms of the intelligence behind them. It understood the query, identified the right entities, and produced the right reasoning. However, it kept messing up the JSON format or adding filler text around the structured output.

No amount of prompting was able to fix this. We tried system prompts, few-shot examples, explicit format instructions. The model would still intermittently produce invalid JSON. Meanwhile, larger cloud models like GPT-4 gave valid JSONs reliably every time.

But here was the critical insight: the small models were intelligent enough to give the right responses. They understood the task. They just couldn't follow output formats reliably. This made us ask a question: can we do something so that we can run a much smaller, faster model while still getting reliable structured JSON responses?

That is exactly why we built constrained decoding. Constrained decoding solves this by restricting which tokens the model is allowed to generate at each step, based on a schema. The model can only produce tokens that lead to valid structured output. It cannot go off-script.

Same Model. Same Input. Night and Day.

❌ Without Constrained Decoding

"I see. Sarah works in Microsoft
and prefers tea, while John lives
in Oregon."
Now, determine the type of output
to provide based on the
conversation...
{"reasonings": ["The conversation
involves two people discussing...

JSON parsed: NO

✅ With Constrained Decoding

{
  "reasoning": "Contains info
    about Sarah and John",
  "db_type": "graph",
  "summary": "Sarah works at
    Microsoft. John lives in
    Portland.",
  "triplets": [
    {"head": "Sarah",
     "relation": "works_at",
     "object": "Microsoft"}
  ]
}

Schema valid: YES

07 — Memory

Audria Knows What to Remember, and How

Most AI assistants treat every piece of information the same way. Audria doesn't.

When you tell Audria something personal, like where you live, that your daughter is learning piano, that your friend Bob collects vintage cameras, that's not the same as a generic conversation about the weather. Personal facts deserve to be stored differently: connected to each other, linked to the people and places they belong to, ready to be reasoned over, not just keyword-matched.

How It Works

Facts about your life, relationships, preferences, locations, stored as a connected web of knowledge (knowledge graph).
Substantive conversations, discussions, ideas, context, stored for retrieval when relevant.
Idle chatter, discarded. Audria doesn't remember noise.

What This Looks Like as a Knowledge Graph

Alice

San Francisco

Bytewave Labs

AI Startups

Bob

Carol

Painting

Knowledge graph automatically constructed from a single conversation snippet

Why Standard AI Memory Falls Short

Query: "Why is David looking for a piano teacher?"

David never said this explicitly. The answer required connecting: David → has daughter Emma → Emma is learning piano → David mentioned wanting support for Emma's education, across separate conversations.

Audria Memory

"David asked me to look for a piano teacher because he is likely interested in helping his daughter Emma with her education."

Standard RAG

"I don't have enough information from the stored context to answer."

The difference isn't just retrieval accuracy. It's the difference between a system that finds your words and a system that understands your world. Every conversation makes Audria's model of you more complete.

08 — The Full Pipeline

Speech to Text: The Front Door

Every interaction with Audria begins with your voice. So the speech-to-text layer can't be an afterthought.

100×

Real-time
speed

≈

Whisper Large v3
accuracy

Network
dependency

100 seconds of audio transcribed in 1 second. Accuracy comparable to OpenAI's Whisper Large v3, the gold standard for speech recognition, matched entirely on-device. Zero network dependency, works in airplane mode, costs nothing per minute.

Unit Economics

Running AI on-device isn't just a technical achievement. It's a business fundamentals story. Every competitor running everything through the cloud pays for every second of every conversation. Our hybrid architecture, NPU for everything we can run locally, cloud only where it genuinely adds capability, results in 90% better unit economics than a cloud-first approach.

As mobile chips improve and models get more efficient, the on-device share of that equation only grows. We're building Audria to ride that curve, and what we've demonstrated today, on a single iPhone 16e in airplane mode, is only the beginning of what's possible on the device already in your pocket.

Audria Research — Shubham Mishra, Pranay Jain, Kenny Miller

Building the voice-first future of personal computing. We make ambient intelligence that disappears, so you can be present.