NLP Engineering

Detecting Topic Drift in Conversations

From cost signals and drift detectors to Earth Mover's Distance: how we built a pipeline that listens for when a conversation changes direction.

A
Audria Research — Raghav Jain, Shubham Mishra
March 2026
18 min read

01: The Problem

Topic Change, Not Topic Detection

Conversations are not documents. A document sits still. You can read it at any speed, in any order, index it, search it. A conversation unfolds in time. One utterance at a time. One topic bleeding into the next, with no chapter headings or paragraph breaks to announce where one idea ends and another begins.

The question worth asking is not what topics exist here. It is when does the topic change.

At any topic boundary, there is a detectable signal: the semantic similarity between what is being said now and what was said just before drops sharply. The conversation "dips" in similarity. The goal is to detect those dips automatically, in real time, without any labeled data.

This is what dip detection means in this context.


02: Formalization

Formalizing the Problem

Think of a conversation (a podcast, an interview, a meeting) as a sequence of utterances. An utterance is a chunk of speech transcribed over a small time interval: a thought, a sentence or two, a response. Strung together, these utterances form segments. Each segment discusses a coherent topic (one stretch about climate policy, the next about economics, another about space exploration).

If you had the entire conversation in front of you, embedded all utterances into a semantic space, and looked at them as a cloud of points, you'd see the structure clearly: utterances within the same topic cluster together. Different topics occupy different regions. Boundaries between topics appear naturally as gaps between clusters.

But that's the offline view. You're looking at the full picture at once.

In reality, a conversation unfolds in real time. Utterances arrive one at a time. When the newest utterance arrives, you don't know what comes next. You have to decide, in the moment: is this utterance still part of the current topic, or has the conversation just shifted direction?

The challenge is this: how do you detect that shift without seeing the full picture?

The answer is to stop thinking about individual utterances and start thinking about patterns. Each topic has a characteristic pattern: the directions its utterances point in semantic space. When you're in the climate segment, utterances cluster around climate concepts. When you shift to economics, the pattern changes. The new utterances point in different semantic directions.

So you track: what is the typical pattern I've been seeing? When does a new utterance deviate sharply from that pattern?

This is the essence of the problem. Formally, we're asking: given a stream of utterances arriving one at a time, each mapped to a semantic embedding, when does the underlying distribution of those embeddings change?

Let's define the terms precisely. An utterance $u_t$ at time $t$ is mapped to an embedding $e_t$, a dense vector in semantic space. A segment is a sequence of consecutive utterances that belong to the same topic:

$$\text{Segment} = \{u_1, u_2, \ldots, u_n\}$$

Each segment forms a distribution, a cloud of embeddings scattered across semantic space. The goal is to find the boundaries $\mathcal{B}$ where the distribution changes. Wherever a boundary exists, the stream before that point and the stream after are semantically different enough to be recognizable.

This is the online change point detection problem, a well-studied challenge in statistics and signal processing for detecting shifts in streaming data. We're adapting it to the semantic embedding stream of a conversation.

The pipeline becomes: Utterances → Embeddings → Pattern recognition → Boundary detection.


03: Signal Construction

From Text to a Detectable Signal

Sentence Embeddings

Each utterance is embedded using a sentence transformer (specifically all-MiniLM-L6-v2) which produces a single 384-dimensional dense vector per utterance. The model places semantically similar utterances close together in the embedding space, regardless of surface vocabulary. "The market collapsed" and "equities fell sharply" land near each other; "it started raining" lands far from both.

After embedding, all vectors are L2-normalized:

$$\hat{e}_t = \frac{e_t}{\|e_t\|}$$

For unit-norm vectors, the dot product is equivalent to cosine similarity:

$$\hat{e}_i \cdot \hat{e}_j = \cos(\theta_{ij})$$

This normalization step at the output of the embedding model has a practical consequence that compounds across the entire pipeline: every similarity computation anywhere downstream (cost signals, scene comparisons, centroid distances) reduces to a plain dot product. No normalization factor, no division, no separate norm computation. In a streaming setting where each new utterance triggers comparisons against a growing history, that reduction is what makes real-time processing tractable.

The Cost Signal

A drift detector operates on a scalar stream, not a 384-dimensional one. The embedding sequence needs to be reduced to a single number per timestep that encodes how semantically surprising this utterance is given what came before.

That number is the cost signal.

Think of it as a semantic temperature reading: low and flat while the conversation holds a topic, spiking when the speaker moves somewhere new. The word signal is deliberate. It is a time-indexed sequence of values, like a price signal or a sensor reading, except what it encodes is the semantic friction at each step of the conversation.

The baseline definition:

$$c_t = 1 - \cos(\hat{e}_t,\ \hat{e}_{t-1}) = 1 - \hat{e}_t \cdot \hat{e}_{t-1}$$

When consecutive utterances are semantically similar, $c_t$ is close to 0. When they diverge, $c_t$ rises toward 1. The scalar stream $\{c_t\}$ is what every drift detector in this pipeline consumes.

Embedding Angle → Cost Value
ut−1
ut
θ êt−1 êt
angle θ
cos(θ)
cost ct = 1 − cos(θ)

Unit vectors êt−1 and êt live on the unit circle after L2 normalization. The cost is the angular gap between them: zero when the conversation holds a topic, rising toward 1 on a shift.


04: Signal Design

The Cost Signal: Three Formulations

The formula $c_t = 1 - \hat{e}_t \cdot \hat{e}_{t-1}$ is the baseline, but it was not the only formulation considered. Three candidates were built and evaluated. They represent the actual evolution of thinking on what a cost signal should capture.

Formulation 1: Short-Term Cost

$$c_t^{\text{short}} = 1 - \hat{e}_t \cdot \hat{e}_{t-1}$$

Compares only the current utterance to the immediately preceding one. Fast, reactive, instantaneous. Sensitive to any individual utterance that diverges from its neighbor, noisy for rambling or transitional speech, but highly responsive at clean boundaries. This is also the formulation that works in a fully streaming setting: it requires no history beyond a single previous embedding.

Formulation 2: Long-Term Cost

$$c_t^{\text{long}} = 1 - \frac{1}{K} \sum_{\text{topK in } [t-W,t)} \left(\hat{e}_t \cdot \hat{e}_{t-j}\right)$$

Rather than comparing to a single predecessor, this compares $\hat{e}_t$ against the top-$K$ most similar embeddings in a lookback window of size $W$. The reasoning: a genuine topic shift should diverge not just from the immediately preceding utterance but from the broader recent context.

This is smoother and more robust to single-utterance noise. It lags, however. By the time the long-term cost rises enough to trigger a boundary, the conversation is already several utterances deep into the new topic.

Formulation 3: Drift Trend

$$c_t^{\text{trend}} = 1 - \hat{e}_{t-1} \cdot \frac{\bar{e}_{\text{hist}}}{\|\bar{e}_{\text{hist}}\|}$$

where $\bar{e}_{\text{hist}} = \frac{1}{|W|}\sum_{j \in W} \hat{e}_j$ is the mean embedding of recent history.

Rather than asking how different the current utterance is, this asks how much the context itself has been moving, the directional drift of the embedding stream as a whole. Designed to catch slow, gradual thematic migration that neither of the above formulations would catch clearly.

Three Cost Signal Formulations
show:
Topic 1 · u₀–u₁₂
"Carbon emissions targets were revised for 2030."
Topic 2 · u₁₃–u₂₅
"Interest rates rose sharply after the Fed announcement."
Topic 3 · u₂₆–u₃₉
"NASA confirmed water ice on the lunar south pole."

Same conversation, three lenses. Short-term spikes exactly at boundaries; long-term lags and decays slowly; drift-trend rises before the shift and captures gradual migration. Hover to inspect values.


05: Detection

Drift Detection Algorithms

With a scalar cost signal $\{c_t\}$ in hand, the remaining task is distinguishing genuine distribution shifts from random fluctuation. Two algorithms from the river library were evaluated for this, and a third (BOCPD) was considered.

KSWIN

Kolmogorov-Smirnov Windowed

KSWIN maintains a sliding window of the cost signal and continuously asks: do the older and more recent observations in this window look like they came from the same distribution?

It partitions the window into two buffers: a reference buffer (older observations) and a test buffer (recent observations). It then compares their statistical profiles, asking how different the two distributions look. If they look significantly different (beyond what random chance would produce), drift is declared.

KSWIN is non-parametric: it makes no assumption about the shape of the underlying distribution. It is also conservative: it requires the test buffer to accumulate enough observations to build a statistically reliable empirical CDF before a declaration can be made. This makes it suited to detecting major structural shifts (moments where the semantic character of the conversation has fundamentally changed). On a 700-utterance conversation, KSWIN surfaces a small number of such boundaries, each corresponding to a clear domain transition.

Page Hinkley

PageHinkley

PageHinkley monitors the cumulative sum of deviations of the cost signal from its running mean. It triggers when that cumulative sum exceeds a threshold, indicating the mean has shifted upward, which means the cost signal has sustained a new, higher level.

Algorithmically, PageHinkley keeps a running average of the cost signal and tracks how much the signal has drifted above that average. It accumulates these deviations over time. Whenever the accumulated drift exceeds a threshold, drift is declared.

PageHinkley is directional: it looks specifically for sustained upward shifts in the cost signal, not downward or oscillating patterns. It is also memory-efficient: it tracks only the current state and requires constant time per update. Because it reacts to any sustained rise, it is considerably more sensitive than KSWIN, picking up not just major domain changes but every sub-topic transition, tangential digression, or shift in speaker focus. On the same 700-utterance conversation, it produces a much denser set of boundaries.

Comparison

Two Detectors, Two Granularities

KSWIN and PageHinkley are not alternatives to the same task. They operate at different granularities of topic change.

KSWIN PageHinkley
Sensitivity Conservative Sensitive
What it captures Major structural domain shifts Every sub-topic transition
Mechanism Distributional test (KS statistic) Mean shift (cumulative sum)
Lag Higher (needs window to fill) Lower (reacts immediately)
Online-native Requires window accumulation Fully stateful, O(1) per step

KSWIN finds where the conversation fundamentally changes its domain. PageHinkley tracks every micro-shift in semantic focus. Which is appropriate depends on the downstream task: coarse scene segmentation or fine-grained dip detection.

Same Signal, Two Granularities
show:
KSWIN: 4 boundaries PageHinkley: 23 boundaries
Topic 1
"Carbon emissions targets were revised for 2030."
Topic 2
"Interest rates rose after the Fed announcement."
Topic 3
"Scientists confirmed water ice on the lunar south pole."
Topic 4
"The senate passed the infrastructure bill yesterday."
Topic 5
"The festival lineup was announced this morning."

KSWIN fires only at major distributional shifts, the 4 moments where the conversation's domain fundamentally changes. PageHinkley fires at every semantic ripple, including sub-topic micro-shifts within each segment. Toggle each detector to compare. Hover to read the cost value.


06: Scene Distance

Optimal Transport: Measuring Scene Similarity

At this point, the conversation stream has been divided into scenes, discrete windows between detected boundaries. But boundaries alone tell only half the story: they answer when the topic changes. The remaining question is: how similar are different scenes? This matters when the conversation circles back. If the discussion returns to climate policy after an hour of economics, we want to recognize that return, to surface the thread that connects these separated scenes.

Once boundaries $\mathcal{B}$ are detected, the conversation divides into scenes. Each scene is a contiguous stretch of utterances between two consecutive boundaries. If boundaries occur at utterances #13 and #26, there are three scenes: [1–12], [13–25], and [26–end]. Each scene is a window where the topic held coherent.

We need to measure how semantically similar two scenes are, in order to surface recurring themes and group related discussions. If a podcast spends time on climate policy early and returns to it later, those two scenes should score as close in semantic space, close enough that a downstream system recognizes them as belonging to the same thread, even though they are separated by unrelated conversation.

Notation: We use subscripts for utterance indices ($u_i$, $e_i$, $\hat{e}_i$) and superscripts for scene membership. Thus $\hat{e}_i^A$ means "the normalized embedding of utterance $i$ from scene $A$".

Themes Emerging Over Time
Climate Policy
Economics
Space Science
same color = same theme returning
Press Play to watch themes emerge

Each dot is an utterance plotted on the cost signal, colored by detected topic. Notice each theme occupies a distinct cost band. Economics runs higher, Climate lower. When the same color reappears after an absence, the pipeline recognized a recurring thread. Hover any dot to read its text.

The Problem with Mean Embeddings

A natural first approach is to represent each scene by the mean of its utterance embeddings:

$$\bar{e}_A = \frac{1}{|A|} \sum_{i \in A} \hat{e}_i$$

Then compute the distance between scenes as the angle between their means:

$$d(A, B) = 1 - \cos\!\left(\bar{e}_A,\ \bar{e}_B\right)$$

This is computationally cheap and intuitive. But it discards a critical piece of information: the internal structure of each scene.

Consider two scenes: Scene A contains five utterances about emissions targets, carbon pricing, regulatory timelines, international treaties, and enforcement mechanisms. Scene B contains three utterances about emissions targets, renewable investment, and ocean acidification. Both discuss climate, so their mean embeddings land in the same region of semantic space. Yet their internal composition is different: Scene A clusters tightly around policy mechanics; Scene B sprawls across climate topics. A distance metric that collapses each scene to a single point treats these as equally similar to some third scene, which may not reflect their actual semantic character.

There is also a structural imbalance: Scene A has five utterances; Scene B has three. A mean computed from three points carries less statistical weight than one from five, yet the metric ignores this asymmetry.

Optimal Transport

Rather than collapse each scene to a single point, treat it as what it actually is: a cloud of utterance embeddings scattered across semantic space. The climate policy scene in hour 1 contains four utterances, each landing at a specific location in the 384-dimensional embedding space. The climate policy scene in hour 3 contains six utterances, different sentences, but about the same topic, landing in roughly the same region.

The question becomes: what is the minimum effort required to rearrange the utterances of Scene A into the configuration of Scene B? Effort here means moving an utterance's semantic mass across the embedding space: short moves are cheap, long moves are expensive.

Each scene is a discrete probability distribution over its utterance embeddings. Every utterance contributes equal mass:

$$\mu_A = \frac{1}{n} \sum_{i=1}^n \delta_{\hat{e}_i^A}, \quad \mu_B = \frac{1}{m} \sum_{j=1}^m \delta_{\hat{e}_j^B}$$

Here, $\delta_{\hat{e}_i^A}$ means "one unit of probability mass placed at the location of embedding $\hat{e}_i^A$." Think of it as: each utterance in a scene is a point in semantic space, and they collectively define the scene's profile. The total mass in each distribution is 1, spread equally across utterances, whether Scene A has 3 utterances or 30.

For each pair of utterances (one from Scene A, one from Scene B) we compute the semantic distance (the transport cost):

$$C_{ij} = 1 - \cos\!\left(\hat{e}_i^A,\ \hat{e}_j^B\right)$$

Now we ask: what matching between utterances minimizes the total transport cost? Climate utterances from Scene A should preferentially match with climate utterances from Scene B (short distance). Scene A economics utterances should match with Scene B economics utterances. The optimal matching is found by solving the transport problem, which yields the Wasserstein distance, a single number representing the total minimum effort.

This metric respects the full structure of each scene, accounts for different scene lengths naturally (unequal distributions are penalized), and gives a principled measure of semantic similarity.

To build intuition, let's see this in action. Below is an interactive visualization with three lessons:

  • 1D · The Metaphor: Watch particles physically move from source to target, cost accumulating step by step
  • 2D · Scene Transport: Real utterance embeddings, mass rings shrinking and growing, the semantic gap visualized
  • Why Not Centroids? Two scenes with identical centers but different shapes. See why centroids fail

Press Play to animate, adjust Speed to explore at your own pace, toggle tabs to see different perspectives. Hover on utterances to read their text.

Geometric Consistency

The cost matrix uses $1 - \cos(\hat{e}_i^A, \hat{e}_j^B)$, the same metric as the cost signal $c_t = 1 - \hat{e}_t \cdot \hat{e}_{t-1}$. This is deliberate: by reusing the cosine metric throughout, all distances, whether between consecutive utterances or across scenes, live in the same geometric framework. The cost signal, the drift detectors, and the scene similarity measure all operate within this unified space.

Conclusion

Conversations are one of the richest and most structurally complex things humans produce, and they have historically been treated by machines as a flat bag of words. The premise here is different: a conversation has geometry. Topics cluster. Transitions spike. Scenes that circle back to the same idea land near each other in the embedding space, even when separated by an hour of intervening content.

What this pipeline does is make that geometry legible. Not by training a classifier, not by defining what a topic is in advance, but by treating the conversation as a stream and listening for where the stream changes. The cost signal catches the moment. The drift detectors decide whether the moment was real. Optimal transport measures how far apart two stretches of that stream actually are, accounting for every utterance, not just an averaged center of mass.

The approach is deliberately domain-agnostic. The same formulation that finds topic boundaries in a financial podcast applies equally to a medical consultation, a research interview, or a multi-hour deposition, with no retraining and no domain-specific tuning. That generality is a direct consequence of grounding everything in geometry rather than labels.

References

  1. PageHinkley: river drift detection: PageHinkley
  2. KSWIN: river drift detection: KSWIN
  3. Wasserstein / Earth Mover's Distance: POT: Python Optimal Transport library
  4. Bayesian Online Changepoint Detection: Adams & MacKay, 2007