Blog #9: Results and Housekeeping

May 4, 2025

Hello and welcome back!

The end is in sight. Thanks for being along for the ride. This week, I finished grading the model outputs, and here are the results:

	Zero-Shot	Finetuned 1	Finetuned 1 with Prompt Engineering	Finetuned 2 with Prompt Engineering
Hit@10	23.81%	25.48%	27.86%	27.86%
Macro-Averaged Hit@10 (with Multi-Label Credit)	24.79%	23.42%	25.06%	28.19%

For a reality check, it would have been nice to have human doctors diagnose the patients in the test set. Since I couldn’t connect with rare disease experts, I approximated the Hit@10 from randomly guessing. There are 445 classes (rare diseases) and 420 patients in total. To simplify my calculations, I made two ideal assumptions:

a) Each patient has one rare disease
b) The rare diseases are evenly distributed in the test set

So, if I randomly guess 10 diseases per patient, there’s a 10/445 chance of being correct. The expected number correct guesses is 10/445*420, giving a Hit@10 of about 2.25%.

Through fine-tuning and prompt engineering, the overall Hit@10 increased by 4.05%, or 17.01% of the baseline. Not bad. There were other techniques I wanted to try, like more hyperparameter optimization and RAG, but I probably don’t have the time or budget for them. Still, they might boost performance further.

While we treated the rare diseases as evenly distributed in our random diagnosis scenario, this assumption isn’t true, even if it’s good enough for an approximation. Because of imbalanced datasets, Hit@10 doesn’t fully reflect performance (as an extreme example, if 75 out of 100 test patients all have the same disease, a model diagnosing that one disease every time would get >75% Hit@10, even if it isn’t helpful).

To account for the accuracy across different diseases, I calculated the macro-averaged Hit@10 values with multi-label credit. When a case is marked as correct, the success is attributed to all disease labels of the patient (explaining why some macro-averaged Hit@10 values are higher than the plain Hit@10). Multi-label credit isn’t ideal, but it suffices for our purpose. If I had the time and patience to annotate all of the correct LLM guesses in each case instead of just marking “TRUE/FALSE” at the patient-level, I would have preferred using Hammings Loss.

Anyway, I found it interesting that overall Hit@10 increased while Macro-Averaged Hit@10 dropped after my first round of fine-tuning without prompt engineering. This suggests that while the LLM did better on more represented rare diseases, it got worse for the rarest. Similar to the 75-patient scenario, I suspect the model overpredicted more represented diseases. It likely picked up this behavior from the training set, which has a similar distribution of rare diseases. Instead of purely learning the patterns for diagnosis in the training set, it might have learned which diseases were more common and was rewarded for guessing correctly.

Fortunately, I hopefully suppressed this behavior through incorporating Orphadata rare disease-phenotype associations to better balance the training set. After the second round of fine-tuning with prompt engineering, macro-averaged Hit@10 rose to 28.19%, the highest across all models. While its plain Hit@10 matches the previous model, it performs better across all rare diseases.

With fine-tuning finished, I zoomed out and realized my repository was a bit messy. CSV files were everywhere and the lack of (or poor) commenting made the project hard to read. So, I spent a good amount of time on housekeeping – reorganizing files into folders, adding comments, renaming variables, etc.

Now that I have a working model, running it from the command line started feeling too clunky. I built a basic website to host it. It’s simple – just vanilla HTML/CSS/JS with a Flask API. This week, I focused primarily on the backend to get the model up and running, though the user interface is still kinda ugly.

For the rest of the time, I’ll probably continue polishing – improving the UI, writing the README, and putting together the final poster.

See you next week!

– Luoxi

View more of Luoxi W.'s posts.

Blog #9: Results and Housekeeping

Reader Interactions

Leave a Reply Cancel reply