4. Scaling Up
March 30, 2026
For the last week, I was running into a lot of technical issues as it seemed like my laptop’s battery was dying and it was eternally stuck at 4% even while plugged in. Hence, my blog post was severely delayed, sorry readers! However, quite recently I was able to secure a machine with much better compute, so I was able to run MIND with Llama 7B and also use the HELM dataset to output detection scores. Next week I will be focusing on combining more methods, but I’ll include my results below. Please bear with the unaesthetic formatting, I promise the next post will be better!
Step 1: generating LLM hallucinations
6000it [3:55:13, 2.35s/it] <– training
1000it [34:58, 2.10s/it] <– testing
1304it [44:16, 2.04s/it] <– validation
Step 2: features
100%|███████████████████████████████████████████████████████████████████████████████| 6000/6000 [17:58<00:00, 5.56it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:03<00:00, 5.45it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 1304/1304 [03:55<00:00, 5.53it/s]
Step 3: train classifier
Valid Epoch 19 …
Training Epoch 20 – 5.99% – Loss : 0.3800174340605736
Training Epoch 20 – 11.98% – Loss : 0.3866754673421383
Training Epoch 20 – 17.96% – Loss : 0.3731822396318118
Training Epoch 20 – 23.95% – Loss : 0.3778659448027611
Training Epoch 20 – 29.94% – Loss : 0.3716168469190598
Training Epoch 20 – 35.93% – Loss : 0.37682656943798065
Training Epoch 20 – 41.92% – Loss : 0.37396922068936483
Training Epoch 20 – 47.90% – Loss : 0.37817635796964166
Training Epoch 20 – 53.89% – Loss : 0.3758967376417584
Training Epoch 20 – 59.88% – Loss : 0.3822471442818642
Training Epoch 20 – 65.87% – Loss : 0.3804460739547556
Training Epoch 20 – 71.86% – Loss : 0.38419307172298434
Training Epoch 20 – 77.84% – Loss : 0.3859799671631593
Training Epoch 20 – 83.83% – Loss : 0.39186887315341407
Training Epoch 20 – 89.82% – Loss : 0.39165537277857465
Training Epoch 20 – 95.81% – Loss : 0.3899928640574217
Training Epoch 20 …
Train Epoch 20 end ! Loss : 65.192378282547; Train Acc: 0.8021068472535741
Valid Epoch 20 …
Best acc : 0.7210526315789474 from epoch 9th;
llamabase7b
(as you can tell, much better accuracy compared to last week).
Step 4: comparing to HELM (human annotated data on hallucinations from Llama 7B)
result_psg_corr: 0.5040323
result_psg_halu: 91.355026
result_sent_corr: 0.4866830
result_sent_halu: 80.774148
Explanation: Probe does better for passage detection than sentence-level. Huge improvement from using Qwen 1.5 last week. However, the correlation coefficients are pretty moderate indicating a good amount of noise.
| Filename | What is inside? |
| result_psg_corr.xlsx | How well the scores match reality for entire paragraphs. |
| result_psg_halu.xlsx | The AUC score for detecting hallucinations in entire paragraphs. |
| result_sent_corr.xlsx | How well the scores match reality for individual sentences. |
| result_sent_halu.xlsx | The AUC score for detecting hallucinations in individual sentences. |
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Hi Anna! Cool progress this week! Seeing the results like this is actually pretty nice!
I found it interesting that there is a noticeable difference between passage level and sentence level. Is this because hallucination is inherently contextual or is there just more noise at the sentence level?
I also am wondering how you plan to calibrate your rankings to more align with human scoring. I’m assuming you’re training with cross-entropy loss, so do you think that plays a part in the moderate correlation?
I’m not very knowledgeable on CEL (not super experienced in ML) but I know it tends to optimize for classification ranking over calibration. Could you explain how that plays into your results a little better?
Hi Anna! Looks like you had a super productive week 🙂
I’m curious about how you decided on your hyperparameters: for example, it looks like you have about 20 epochs. Have you tried experimenting with numbers like that, and is it possible that doing so could affect your results slightly?