3. MIND

March 21, 2026

Progress Updates

This week I finally was able to do some more experiments! As outlined in my post last week, I went through the repository Awesome-Hallucination-Detection-and-Mitigation and decided to select MIND: Unsupervised Hallucination Detection Framework for LLMs to replicate. The repository included a dataset of LLM-generated texts along with human-annotated hallucination labels, contextualized embeddings, self-attentions, and hidden-layer activations, called HELM, which can be useful for other experiments. It also contained well-documented steps to train the hallucination detector. Since I don’t have powerful compute, I decided to use the framework to analyze Qwen1.5-0.5B-Chat. With less parameters, I knew that the results of the hallucination detector may be not as good. The original work used much larger models, including Llama 7B and Falcon 40B. This involved a lot of modification to the original code as every LLM has different parameters and token arrangements. However, even with a smaller model, running the whole repository took my laptop around 3 days nonstop. Now I will go through what exactly is happening in MIND and the results I got.

Step 1: generate_data.py

This step is to artificially create hallucinated sentences for training. The repository includes titles and excerpts of Wikipedia articles. The NLP library spaCy is used to extract entities from the Wikipedia excerpts. For example in “The Big Goodbye is the 12th episode of Star Trek”, the entities would be [“12th”, “Star Trek”, “1988”]. Then the LLM is prompted to “Tell me something about [topic].” Meanwhile, the code checks the top-k predictions to find a substitute for the entity, while also doing some filtering to ensure this substitute entity is completely different and an actual hallucination. These hallucinations are added to datasets with training, testing, and validation splits. In some cases, a hallucination is not generated but still added to the datasets. This was the most cost-intensive part of running the code. My computer averaged about 10 seconds per generation, and there were thousands of prompts.

a sample generated hallucination. Instead of twentieth studio album, the LLM says debut.

Step 2: generate_hd.py

In this step, the code extracts the internal hidden states of the model for each sentence and saves them as feature vectors for the classifier. Each sentence is passed through the model again as if the model is reading/continuing them. Then the hidden states at every layer are saved, particularly the last token’s hidden state, the mean of the first layer’s hidden states for all tokens, and the mean of the last layer’s hidden states for all tokens. The data is also split into right and hallu data.

these are what some of the hidden states look like! as you can tell there’s a loooot of data which is why my computer was running for so long.

Step 3: train.py

This part trained the classifier on the hidden state data (specifically last_token_mean vector + last_mean vector) for detecting hallucinations. It is a multi-layer perceptron classifier with four simple layers, training over 20 epochs.

Input (2048) → Dropout → Linear(2048 → 256) + ReLU → Linear(256 → 128) + ReLU → Linear(128 → 64) + ReLU→ Linear(64 → 2) ← outputs 2 scores (real vs hallucinated)

As you can see, unfortunately the best accuracy would be around 63.6%, which is only slightly above a 50% coin flip. A bit disappointing but expected given all the model compression I did.

Step 4: detection_score.py

This step comprehensively evaluates the classifier’s performance on the HELM dataset. However, the HELM dataset only included data for the big models, so I didn’t actually complete this step. There is code to create HELM data for a specific model, but that would involve another several hours of running.

Summary: I learned a lot about the code through running it and got a classifier working, even though it was not the most accurate.

Next Steps: I will experiment with more repositories to figure out how to improve the accuracy. There is a more recent follow-up paper to this one called RACE for large reasoning models (LRMs). While I probably cannot run such large models, I saw that the paper outlines how “prior black-box hallucination detection methods are fundamentally flawed when applied to LRMs,” showing the constraints of such approaches. While most papers simply involve training a classifier on hallucinated vs un-hallucinated internal states, I hope to test out a more distinct approach like TSVs mentioned in the last post.

That’s it for now, and I’ll hopefully be back next week with more experimentation results and explanations! Thank you for reading.

View more of Anna D.'s posts.

3. MIND

Progress Updates

Reader Interactions

Leave a Reply Cancel reply