4. Scaling Up

March 30, 2026

For the last week, I was running into a lot of technical issues as it seemed like my laptop’s battery was dying and it was eternally stuck at 4% even while plugged in. Hence, my blog post was severely delayed, sorry readers! However, quite recently I was able to secure a machine with much better compute, so I was able to run MIND with Llama 7B and also use the HELM dataset to output detection scores. Next week I will be focusing on combining more methods, but I’ll include my results below. Please bear with the unaesthetic formatting, I promise the next post will be better!

Step 1: generating LLM hallucinations

6000it [3:55:13, 2.35s/it] <– training

1000it [34:58, 2.10s/it] <– testing

1304it [44:16, 2.04s/it] <– validation

Step 2: features

100%|███████████████████████████████████████████████████████████████████████████████| 6000/6000 [17:58<00:00, 5.56it/s]

100%|███████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:03<00:00, 5.45it/s]

100%|███████████████████████████████████████████████████████████████████████████████| 1304/1304 [03:55<00:00, 5.53it/s]

Step 3: train classifier

Valid Epoch 19 …

Training Epoch 20 – 5.99% – Loss : 0.3800174340605736

Training Epoch 20 – 11.98% – Loss : 0.3866754673421383

Training Epoch 20 – 17.96% – Loss : 0.3731822396318118

Training Epoch 20 – 23.95% – Loss : 0.3778659448027611

Training Epoch 20 – 29.94% – Loss : 0.3716168469190598

Training Epoch 20 – 35.93% – Loss : 0.37682656943798065

Training Epoch 20 – 41.92% – Loss : 0.37396922068936483

Training Epoch 20 – 47.90% – Loss : 0.37817635796964166

Training Epoch 20 – 53.89% – Loss : 0.3758967376417584

Training Epoch 20 – 59.88% – Loss : 0.3822471442818642

Training Epoch 20 – 65.87% – Loss : 0.3804460739547556

Training Epoch 20 – 71.86% – Loss : 0.38419307172298434

Training Epoch 20 – 77.84% – Loss : 0.3859799671631593

Training Epoch 20 – 83.83% – Loss : 0.39186887315341407

Training Epoch 20 – 89.82% – Loss : 0.39165537277857465

Training Epoch 20 – 95.81% – Loss : 0.3899928640574217

Training Epoch 20 …

Train Epoch 20 end ! Loss : 65.192378282547; Train Acc: 0.8021068472535741

Valid Epoch 20 …

Best acc : 0.7210526315789474 from epoch 9th;

llamabase7b

(as you can tell, much better accuracy compared to last week).

Step 4: comparing to HELM (human annotated data on hallucinations from Llama 7B)

result_psg_corr: 0.5040323

result_psg_halu: 91.355026

result_sent_corr: 0.4866830

result_sent_halu: 80.774148

Explanation: Probe does better for passage detection than sentence-level. Huge improvement from using Qwen 1.5 last week. However, the correlation coefficients are pretty moderate indicating a good amount of noise.

Filename	What is inside?
result_psg_corr.xlsx	How well the scores match reality for entire paragraphs.
result_psg_halu.xlsx	The AUC score for detecting hallucinations in entire paragraphs.
result_sent_corr.xlsx	How well the scores match reality for individual sentences.
result_sent_halu.xlsx	The AUC score for detecting hallucinations in individual sentences.

View more of Anna D.'s posts.

Comments

Leo L. says

April 1, 2026 at 6:40 am

Hi Anna! Cool progress this week! Seeing the results like this is actually pretty nice!

I found it interesting that there is a noticeable difference between passage level and sentence level. Is this because hallucination is inherently contextual or is there just more noise at the sentence level?

I also am wondering how you plan to calibrate your rankings to more align with human scoring. I’m assuming you’re training with cross-entropy loss, so do you think that plays a part in the moderate correlation?

I’m not very knowledgeable on CEL (not super experienced in ML) but I know it tends to optimize for classification ranking over calibration. Could you explain how that plays into your results a little better?

Log in to Reply
rinishag2026 says

April 3, 2026 at 5:23 pm

Hi Anna! Looks like you had a super productive week 🙂

I’m curious about how you decided on your hyperparameters: for example, it looks like you have about 20 epochs. Have you tried experimenting with numbers like that, and is it possible that doing so could affect your results slightly?

Log in to Reply

4. Scaling Up

Reader Interactions

Comments

Leave a Reply Cancel reply