8. Probing Setup

May 2, 2026

In the last post, we discussed an overview of the methods our framework will use. In this post, we will look at the three probing methods.

All the probing methods focus on the internal representations of the LLM. To first train the probes, we follow the following steps. We evaluate the internal truthfulness of a LLM using the TruthfulQA dataset on HuggingFace. Each instance in the dataset consists of a question Q, a set of correct answers A_true, and a set of plausible but incorrect (hallucinated) answers A_hall. For each question, we construct contrastive pairs to isolate the model’s internal representation of truth versus falsehood.

Now, to extract the latent LLM representations, we process each question-answer pair through the frozen backbone of the model. Given the input text T=[Q;A], we extract the hidden states H from the transformer blocks.

To capture the most important information regarding the model’s “belief,” we utilize a pooling strategy. Specifically, we identify the span of tokens corresponding only to the answer, extract the hidden states from the final 4 layers, and then compute the mean-pooled representation for each layer across the answer tokens and concatenate these layer-wise means into a single high-dimensional feature vector. This strategy ensures that the probe focuses on the model’s generated output.

Now about the 3 detection methods themselves:

Truthfulness Separator Vector (TSV): We train a linear probe to identify a “truth direction” within the model’s residual stream. The TSV identifies two centroids, mu_truth and mu_hall, representing truthful and hallucinatory states, respectively. The TSV score is the relative cosine similarity of the current hidden state to these centroids. If the activation leans toward the hallucination centroid, the Risk Score spikes. This TSV is slightly different from the original implementation to keep complexity low. In the original implementation, TSV learns a single d-dimensional vector that is added to the model’s hidden states at an intermediate layer. This steers the internal representations toward a latent space where truthful and hallucinated data are more easily separated. it models the final-layer embeddings using a von Mises-Fisher distribution, which treats data points as directions on a high-dimensional sphere. By maximizing the likelihood that truthful and hallucinated examples cluster around distinct class centroids, the system learns the optimal TSV that pushes these clusters apart.
Lookback Lens (LB): Next we train a classifier that monitors attention weights to determine “groundedness.” It looks at whether the model is “looking back” at the context (grounding its answer in the prompt) or if it is “hallucinating in a vacuum” (ignoring context). High “lookback” indicates context-grounded generation, while low lookback signals potential drift that could point to hallucinations.
Sparse Autoencoder (SAE): We use a SAE to decompose hidden states into interpretable features. The controller monitors the activation of specific “error-correlated” features that act as semantic red flags.

View more of Anna D.'s posts.

8. Probing Setup

Reader Interactions

Leave a Reply Cancel reply